Google DeepMind's DiffusionGemma Rewrites Text Generation at 1000 Tokens per Second

EDITORIAL LEADERBOARD

Google DeepMind

2H AGO

2 min read

LLMS

small_models structured_output

OPEN_SOURCE

2 hrs ago

LLMS

small_models structured_output

OPEN_SOURCE

2 min read

Google DeepMind just released DiffusionGemma, an experimental open model that abandons the word-by-word generation paradigm that has defined large language models since GPT-2. Instead of predicting one token at a time, it drafts an entire 256-token block simultaneously and iteratively refines it, delivering up to 4x faster text generation on dedicated GPUs. The weights are available now on Hugging Face under an Apache 2.0 license.

The problem with typewriters

To understand why this matters, you need to understand the bottleneck that plagues local LLM inference. Most language models act like a typewriter, generating one token at a time from left to right. When run locally for a single user, this word-by-word process leaves your dedicated GPU underutilized , it spends most of its time simply waiting for the next "keystroke." The GPU's thousands of compute cores sit mostly idle, starved by memory bandwidth.

DiffusionGemma reverses this inefficiency by shifting the decode bottleneck from memory-bandwidth to compute. Instead of predicting words sequentially, it drafts an entire 256-token paragraph simultaneously, giving the processor a larger chunk of work at once and utilizing hardware to its full potential. The result: 1000+ tokens per second on a single NVIDIA H100, and 700+ tokens per second on an NVIDIA GeForce RTX 5090.

How text diffusion actually works

Diffusion language models leverage a method more commonly seen in image generation: starting with random noise and gradually refining it into a coherent output. Applied to text, the process works in three stages:

The canvas: The model starts with a 256-token block filled with random placeholder tokens.
Iterative refinement: Over multiple denoising passes, highly confident tokens get locked in and serve as context clues to resolve adjacent positions.
Self-correction: If a token's confidence drops during a pass, the sampler can re-noise and replace it. The entire block snaps into focus.

The key architectural insight is bidirectional attention

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

The problem with typewriters

How text diffusion actually works

Don't miss what's next in AI