Google DeepMind's Gemma 4 QAT Slashes Memory 72% for Local Deployment

EDITORIAL LEADERBOARD

Google Gemma

Google DeepMind's Gemma 4 QAT Slashes Memory 72% for Local Deployment

Jun 05, 2026

2 min read

GPUS

edge_deployment quantization

LLMS

small_models

Jun 05, 2026

GPUS

edge_deployment quantization

LLMS

small_models

2 min read

Running a capable open-weight model locally has always been a hardware negotiation. You either pay for cloud inference, settle for a weaker model, or buy more GPU memory. Google DeepMind just shifted that equation significantly: Gemma 4 QAT checkpoints are now live on Hugging Face for all five model sizes, cutting memory requirements by roughly 72% while keeping quality close to the full-precision originals.

The problem with squishing models after the fact

Standard quantization -- called Post-Training Quantization (PTQ) -- works by taking a fully trained model and rounding its weights down to lower-precision values. It is fast and cheap, but every rounding step introduces error, and those errors compound across dozens of transformer layers. The result is a smaller model that is also noticeably less capable.

Quantization-Aware Training (QAT) takes a different approach. By simulating quantization during training, QAT minimizes quality loss when the model is compressed. The model learns to work within the constraints of low precision from the start, rather than having compression imposed on it afterward. QAT consistently beats standard PTQ at the same compression level, landing within a few points of the BF16 originals.

What actually ships

The Gemma 4 models optimized with QAT are available in five sizes: Gemma 4 E2B, Gemma 4 E4B, Gemma 4 12B, Gemma 4 26B A4B, and Gemma 4 31B. Four checkpoint formats are provided to match different deployment targets:

Unquantized QAT (Q4_0): Half-precision weights extracted from the QAT pipeline, ideal for custom downstream compilation and research.
GGUF (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility. Works directly with llama.cpp, Ollama, and LM Studio.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

The problem with squishing models after the fact

What actually ships

Don't miss what's next in AI