Google's DiffusionGemma generates text in blocks, not tokens

Google has open-sourced DiffusionGemma, a 26-billion-parameter mixture-of-experts model that generates text by diffusion, producing 256 tokens per forward pass instead of one at a time.

The speed numbers are concrete: more than 1,000 tokens per second on a single NVIDIA H100 and 700-plus on an RTX 5090, up to 4x faster than autoregressive decoding, while activating only 3.8 billion parameters and fitting inside 18GB of VRAM when quantised. The catch is buried in Google's own post: autoregressive Gemma 4 "remains the standard for high-quality production outputs." DiffusionGemma trades some quality for latency, aimed at interactive uses like in-line editing and rapid iteration.

Diffusion text generation has mostly been a research curiosity. Shipping it under an Apache 2.0 licence with consumer-GPU numbers makes it something developers can actually benchmark against their autoregressive defaults this week.