Google Shrinks Gemma 4 to 1GB so It Runs on Your Phone

LM Studio

Google Shrinks Gemma 4 to 1GB so It Runs on Your Phone

Jun 05, 2026

2 min read

Jun 05, 2026

2 min read

Google just dropped a set of quantization-aware training (QAT) checkpoints for every member of the Gemma 4 family, and they're already live in LM Studio. The release covers the entire lineup, from the tiny E2B up to the 31B dense model, and is specifically engineered to let these models squeeze onto consumer GPUs, laptops, and even phones without the usual quality cliff that comes from aggressive compression.

What QAT actually changes

The standard approach for shrinking a model after training is post-training quantization (PTQ), where you take fully trained weights and round them down to lower precision such as 4-bit integers. It works, but standard Post-Training Quantization (PTQ) often leads to performance degradation. QAT takes a different route: by simulating quantization during training, QAT minimizes quality loss when the model is compressed, so the network learns to be robust to the rounding errors instead of being blindsided by them at the end.

The practical payoff is significant. This model card is for the new versions of the Gemma 4 family optimized with Quantization-Aware Training (QAT), which allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model. Unsloth's port advertises roughly 3x less memory use and near original accuracy versus the full-precision checkpoints.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Google Shrinks Gemma 4 to 1GB so It Runs on Your Phone

Takeaways

What QAT actually changes

Don't miss what's next in AI