DiffusionGemma is Google DeepMind's first open-weights diffusion language model (dLLM) , and it flips the fundamental assumption behind how LLMs generate text. Instead of producing one token at a time, it generates and refines an entire 256-token block in parallel. The result is a model that hits speeds no autoregressive system can match on a single GPU, at the cost of some accuracy on standard benchmarks.

DiffusionGemma is built by Google DeepMind on the 26B A4B Mixture-of-Experts Gemma 4 architecture, generating tokens using discrete diffusion. The open-weights model is multimodal, handling text, image, and video inputs to generate text output. It was released under Apache 2.0, with weights available on HuggingFace at google/diffusiongemma-26B-A4B-it.

The bottleneck nobody was talking about

Every GPU-based LLM deployment has a dirty secret: the GPU is mostly waiting. Standard autoregressive models generate one token per forward pass, which means the GPU must reload the model's weights from memory for each token. At small batch sizes , exactly the scenario for local or interactive apps , the GPU's compute cores sit idle while memory bandwidth does all the work.

DiffusionGemma bypasses this limitation by shifting the bottleneck from memory bandwidth to compute, generating and refining a 256-token canvas in parallel. By providing the GPU with a large parallel workload, it utilizes tensor cores that would otherwise sit idle during local serving. The payoff is dramatic: up to 700+ tokens per second on an NVIDIA GeForce RTX 5090 and 1000+ tokens per second on a single NVIDIA H100.

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves