Google has released DiffusionGemma, an experimental open model that abandons the token-by-token generation approach used by every major LLM today. Instead of predicting one word at a time, it drafts an entire block of 256 tokens simultaneously and refines them through iterative passes , the same core idea that powers image generators like Stable Diffusion, now applied to text at production scale.

DiffusionGemma is an experimental open model that explores text diffusion, an exceptionally fast approach to text generation. Released under an Apache 2.0 license, this 26B Mixture of Experts (MoE) model moves beyond the sequential token-by-token processing of typical autoregressive LLMs, generating entire blocks of text simultaneously and delivering up to 4x faster text generation on GPUs.

The problem it's solving

Local inference has always had a dirty secret: your GPU is mostly idle. In the cloud, sequential generation is efficient because servers can batch thousands of user requests together to share the hardware load. But when run locally for a single user, this word-by-word process leaves your dedicated GPU underutilized , it spends most of its time simply waiting for the next "keystroke."

Because DiffusionGemma processes the full block in parallel, the weights are loaded once per refinement pass and applied across 256 tokens simultaneously. This shifts the inference bottleneck from memory bandwidth to raw computational throughput, turning the memory wall into a non-problem for single-user workloads. The result is a fundamentally different hardware utilization profile , one that happens to be ideal for local, low-concurrency deployments.

How text diffusion actually works

Text diffusion (also called a discrete diffusion language model, or dLLM) is a generation paradigm borrowed from image synthesis. Instead of committing to tokens left-to-right, the model starts with noise and refines its way to coherence. DiffusionGemma uses a specific variant called Uniform State Diffusion. Here's the process in three steps:

  1. Canvas initialization: The model starts with a block of random placeholder tokens , essentially a blank canvas of noise.
  2. Iterative denoising: The model makes multiple forward passes, locking in high-confidence tokens and using them as context to refine the remaining uncertain ones.
Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves