Google DeepMind's Gemma 4 Now Runs a 26B Model on a 16GB Laptop

Google AI Developers

Google DeepMind's Gemma 4 Now Runs a 26B Model on a 16GB Laptop

Jun 05, 2026

2 min read

GPUS

edge_deployment quantization

OPEN_SOURCE

Jun 05, 2026

GPUS

edge_deployment quantization

OPEN_SOURCE

2 min read

Google DeepMind just released Quantization-Aware Training (QAT) checkpoints for the entire Gemma 4 family, and the numbers are hard to ignore: the flagship 26B model now fits on a 16GB laptop, and the smallest E2B variant runs in under 1GB on a phone. This is not a minor tweak. It is a rethinking of how compression should be done for large language models targeting edge hardware.

The problem with shrinking models the old way

Every time you run a large model locally, you are fighting memory. The standard fix has been Post-Training Quantization (PTQ): take a fully trained model and round its weights down from 16-bit floats to 4-bit integers after the fact. PTQ takes a fully trained model and compresses its weights into a lower-precision format after training, but accuracy often degrades because the model never learned to compensate for the quantization noise. The quality loss is real and measurable. Traditional PTQ loses 10-15% quality compared to the full-precision baseline.

QAT (Quantization-Aware Training) takes a fundamentally different approach. Instead of simply quantizing the model after training, QAT integrates the quantization process directly into training. Concretely, during training, the model simulates the quantization step in its forward pass, so it learns weights that are robust to the rounding error introduced by lower precision. The model is not surprised by compression at deployment time because it was trained to expect it.

What's actually in the release

The Gemma 4 models optimized with QAT are available in five sizes: Gemma 4 E2B, Gemma 4 E4B, Gemma 4 12B, Gemma 4 26B A4B, and Gemma 4 31B. Two distinct formats are being shipped:

Q4_0 GGUF checkpoints for all five sizes, ready for use with llama.cpp, Ollama, and LM Studio. These are the go-to format for consumer GPU and laptop deployments.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Google DeepMind's Gemma 4 Now Runs a 26B Model on a 16GB Laptop

Takeaways

The problem with shrinking models the old way

What's actually in the release

Don't miss what's next in AI