
Speed has always been Cerebras's pitch. The company built its entire identity around wafer-scale chips , processors the size of an entire silicon wafer , that can run inference at speeds GPU clusters simply cannot match. Until now, that speed advantage was text-only. That changes with the launch of Gemma 4 on Cerebras Inference, which marks the first time the platform supports image inputs, and does so at a pace that redefines what real-time multimodal AI can feel like.
What Just Happened
Gemma 4 is now in private preview on Cerebras Inference, with general availability later this month. It is the first Google DeepMind model Cerebras has brought to the platform, and the first to let developers feed images , screenshots, documents, charts, UI states , into a model running at wafer-scale speed. To celebrate, Cerebras and Google DeepMind are co-hosting a 24-hour hackathon with a $5,000 prize pool, with the top project featured by both companies.
The Number That Matters
Cerebras runs Gemma 4 at over 1,500 output tokens per second. By comparison, Claude Haiku runs at roughly 100 tokens per second , a 15x speedup against the most directly comparable production model, at quality that lands in the same band and at a price lower per output token.
That gap is not just a benchmark flex. Speed compounds in exactly the workloads Gemma 4 is built for. Multimodal and agentic loops rarely call a model once: they inspect a visual input, reason over it, produce structured output, call tools, check the result, and try again. At 100 tokens per second, those loops feel sluggish. At 1,500, they feel instant.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves

