vLLM v0.23.0 Ships DeepSeek-V4 Production Hardening and 56% Throughput Boost

EDITORIAL LEADERBOARD

vLLM

5H AGO

3 min read

INFRA

inference_optimization model_serving

GPUS

kernels

5 hrs ago

INFRA

inference_optimization model_serving

GPUS

kernels

3 min read

vLLM v0.23.0 is out, and it reads less like a patch release and more like a platform consolidation. The release packs 408 commits from 200 contributors, 63 of whom are first-timers. The headline themes are clear: DeepSeek-V4 gets a serious production hardening pass, the new Model Runner V2 execution core expands to cover Llama and Mistral, KV cache offloading grows a third storage tier, and the experimental Rust frontend keeps adding features at a steady clip.

DeepSeek-V4 Gets a Real Production Pass

DeepSeek-V4 was introduced in v0.22.0, but this release is where it starts to feel production-ready. Following its introduction, DeepSeek-V4 received another large hardening and optimization pass: its sparse MLA metadata is now decoupled from DeepSeek-V3.2, it gained a TRTLLM-gen attention kernel, EPLB support for the Mega-MoE, and selective prefix-cache retention for sliding-window KV cache.

The MLA (Multi-head Latent Attention) decoupling from V3.2 matters because it means the two models no longer share internal state that can cause subtle correctness bugs when running them side by side. The TRTLLM-gen attention kernel is a TensorRT-LLM-derived kernel that replaces a more generic path with one tuned specifically for DeepSeek's attention pattern. The model was also detached from torch.compile, and its attention and RoPE paths were refactored. Detaching from torch.compile trades some potential graph-optimization upside for more predictable startup times and easier debugging in production.

EPLB (Expert Parallel Load Balancer) is worth understanding if you run large MoE models at scale. While MoE models are typically trained so that each expert receives a similar number of tokens, in practice the distribution of tokens across experts can be highly skewed. vLLM's EPLB redistributes expert mappings across EP ranks, evening the load. To implement EPLB, each MoE forward pass records per-token load, and a sliding window aggregates these statistics across EP ranks. In v0.23.0, this mechanism now covers DeepSeek-V4's larger "Mega-MoE" configuration, which has more experts and a wider routing fan-out than V3.

Model Runner V2 Becomes the Default for Dense Models

Model Runner V2 (MRv2) is vLLM's ground-up rewrite of its core execution engine, and it's been rolling out model-by-model since it launched. The team revisited persistent batching, async scheduling, input preparation, and sampling, then rebuilt the model runner around three core principles: be modular (isolate model-specific logic from the common execution path), be GPU-native (move bookkeeping off the CPU and onto the GPU), and be async-first (treat overlapped CPU/GPU execution as a design constraint, not a retrofit).

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

DeepSeek-V4 Gets a Real Production Pass

Model Runner V2 Becomes the Default for Dense Models

Don't miss what's next in AI