

Running a capable open-weight model locally has always been a hardware negotiation. You either pay for cloud inference, settle for a weaker model, or buy more GPU memory. Google DeepMind just shifted that equation significantly: Gemma 4 QAT checkpoints are now live on Hugging Face for all five model sizes, cutting memory requirements by roughly 72% while keeping quality close to the full-precision originals.
The problem with squishing models after the fact
Standard quantization -- called Post-Training Quantization (PTQ) -- works by taking a fully trained model and rounding its weights down to lower-precision values. It is fast and cheap, but every rounding step introduces error, and those errors compound across dozens of transformer layers. The result is a smaller model that is also noticeably less capable.
Quantization-Aware Training (QAT) takes a different approach. By simulating quantization during training, QAT minimizes quality loss when the model is compressed. The model learns to work within the constraints of low precision from the start, rather than having compression imposed on it afterward. QAT consistently beats standard PTQ at the same compression level, landing within a few points of the BF16 originals.
What actually ships
The Gemma 4 models optimized with QAT are available in five sizes: Gemma 4 E2B, Gemma 4 E4B, Gemma 4 12B, Gemma 4 26B A4B, and Gemma 4 31B. Four checkpoint formats are provided to match different deployment targets:
- Unquantized QAT (Q4_0): Half-precision weights extracted from the QAT pipeline, ideal for custom downstream compilation and research.
- GGUF (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility. Works directly with llama.cpp, Ollama, and LM Studio.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
