Cerebras Runs Google DeepMind's Gemma 4 at 1,500 Tokens per Second

Google Gemma

Cerebras Runs Google DeepMind's Gemma 4 at 1,500 Tokens per Second

3H AGO

2 min read

3 hrs ago

2 min read

Cerebras and Google DeepMind just crossed a milestone together: Gemma 4 31B is now running on Cerebras Inference, making it the first multimodal model on the platform and the first Google DeepMind model Cerebras has ever hosted. To celebrate, they're running a 24-hour virtual hackathon this Sunday with $5,000 in prizes and early API access for participants.

Speed that changes what's possible

The headline number is hard to ignore. Cerebras runs Gemma 4 at over 1,500 output tokens per second. By comparison, Claude Haiku runs at roughly 100 tokens per second. That's a 15x speedup against one of the most popular production-grade models on the market, at roughly equivalent quality.

But raw speed only matters if it changes what you can build. Here it does. Multimodal and agentic loops rarely call a model once: they inspect a visual input, reason over it, produce structured output, call tools, check the result, and try again. At 100 tokens per second those loops are too slow to provide real-time input. At 1,500 TPS, the application and user can work together at the same time. Front-end iteration feels near-instant, document and screenshot workflows return in a fraction of the time, and developers can fit more verification and more retries into the same product.

What Gemma 4 31B actually is

Gemma 4 31B is the flagship of Google DeepMind's open-weight Gemma family , a dense model (meaning all parameters are active on every forward pass, unlike sparse Mixture-of-Experts models) built for quality and efficiency rather than raw parameter count. Dense models achieve high model intelligence without the large memory footprint of MoE models. Gemma 4 hits a sweet spot: strong enough for serious work, efficient to serve, and open enough to build around without vendor lock-in.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Cerebras Runs Google DeepMind's Gemma 4 at 1,500 Tokens per Second

Takeaways

Speed that changes what's possible

What Gemma 4 31B actually is

Don't miss what's next in AI