NVIDIA Shrinks GLM-5.2 Memory by 1.8x With NVFP4 Without Losing Accuracy

vLLM

NVIDIA Shrinks GLM-5.2 Memory by 1.8x With NVFP4 Without Losing Accuracy

3H AGO

2 min read

LLMS

INFRA

inference_optimization model_serving

3 hrs ago

LLMS

INFRA

inference_optimization model_serving

2 min read

GLM-5.2-NVFP4 is now ready to serve in vLLM. NVIDIA just dropped the official NVFP4 checkpoint of Z.ai's GLM-5.2, the 744B-parameter MoE model built for long-horizon coding and agentic tasks, and it's already deployable with a single vllm serve command. The headline promise: smaller memory footprint than FP8, same accuracy.

The model underneath

GLM-5.2 is an open-weights model from Z.ai (formerly Zhipu AI), tuned heavily for software engineering, multi-step reasoning, and tool-augmented agent work. It builds on the Mixture-of-Experts (MoE) foundation introduced with GLM-5 and GLM-5.1, extending the context window to a usable 1 million tokens while preserving strong coding performance.

It uses a MoE design with approximately 753B total parameters and roughly 40B active per token. That last number is what actually matters for compute cost: only 40B parameters fire per forward pass, not 753B. The headline change over GLM-5 / 5.1 is that Multi-Token Prediction (MTP) is extended from 3 to 5 draft tokens, lifting end-to-end throughput on reasoning, coding, and agentic workloads. MTP is a speculative decoding technique where the model predicts multiple future tokens in parallel, then verifies them, effectively getting more output per GPU cycle.

GLM-5.2 also introduces IndexShare, which reuses the same attention indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9x at a 1M context length. This is what makes a 1M-token context window practical rather than just a marketing number.

What NVFP4 actually is

NVFP4 is not your typical INT4 quantization. NVIDIA Blackwell's NVFP4 is a 4-bit floating point format designed to improve model accuracy at ultra-low precision using a two-level scaling strategy. It reduces quantization error by using a smaller block size of 16 values, compared to its predecessor MXFP4 which used 32, allowing for more localized adaptation to the data's dynamic range.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

NVIDIA Shrinks GLM-5.2 Memory by 1.8x With NVFP4 Without Losing Accuracy

Takeaways

The model underneath

What NVFP4 actually is

Don't miss what's next in AI