

Google just dropped a set of quantization-aware training (QAT) checkpoints for every member of the Gemma 4 family, and they're already live in LM Studio. The release covers the entire lineup, from the tiny E2B up to the 31B dense model, and is specifically engineered to let these models squeeze onto consumer GPUs, laptops, and even phones without the usual quality cliff that comes from aggressive compression.
What QAT actually changes
The standard approach for shrinking a model after training is post-training quantization (PTQ), where you take fully trained weights and round them down to lower precision such as 4-bit integers. It works, but standard Post-Training Quantization (PTQ) often leads to performance degradation. QAT takes a different route: by simulating quantization during training, QAT minimizes quality loss when the model is compressed, so the network learns to be robust to the rounding errors instead of being blindsided by them at the end.
The practical payoff is significant. This model card is for the new versions of the Gemma 4 family optimized with Quantization-Aware Training (QAT), which allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model. Unsloth's port advertises roughly 3x less memory use and near original accuracy versus the full-precision checkpoints.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
