Novita AI's PegaFlow Stops vLLM From Losing Its Entire Cache on Restart

vLLM

May 20, 2026

2 min read

INFRA

inference_optimization model_serving

TRAINING_INFRA

distributed_training

May 20, 2026

INFRA

inference_optimization model_serving

TRAINING_INFRA

distributed_training

2 min read

Every time vLLM restarts, hundreds of gigabytes of carefully warmed KV cache vanish. For teams running production inference fleets, that means slower cold starts, wasted GPU cycles re-computing prefills, and cache hit rates that reset to zero after every upgrade or crash. PegaFlow, built by Novita AI in collaboration with the vLLM team, is a direct fix: a standalone external KV cache service that outlives the inference engine itself.

PegaFlow integrates with vLLM as an external KV cache service, implemented as a standalone Rust process and connected through the external KV connector interface. It moves KV cache lifetime out of the vLLM worker process, pools cache across local instances and remote nodes, and combines pinned host memory, RDMA-accessible remote memory, and SSD into a three-level cache hierarchy.

The problem nobody talks about enough

KV cache (the stored attention keys and values for previously processed tokens) is one of the most expensive assets in LLM serving. It takes time to allocate, time to warm, and can occupy hundreds of gigabytes per host. The problem is that in every conventional vLLM deployment, that cache lives inside the inference process itself.

In a conventional in-process design, that asset is tightly coupled to the inference engine process. This coupling becomes painful during engine crashes, rolling upgrades, and model switches. When an engine restarts, the host KV pool disappears with it. For a production fleet doing rolling deploys, this is a constant tax on latency and throughput.

There is a second, less obvious problem: isolation. When multiple small-model instances run on one host, running eight Qwen3-8B instances on an 8-GPU host can store the same system prompt eight times. For models such as DeepSeek-V3.2, the logical latent KV can be stored once, but an in-process TP8 deployment may physically store it once per rank. The cache budget is the same, but most of it is wasted on duplicates.

A daemon that owns the cache

PegaFlow addresses this by moving the KV cache runtime into a standalone daemon on each machine. The PegaFlow server owns the host KV pool, SSD cache, topology metadata, RDMA resources, indexing state, and background tasks. vLLM workers connect to the local PegaFlow process through CUDA IPC on the data path and gRPC on the local control path.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Novita AI's PegaFlow Stops vLLM From Losing Its Entire Cache on Restart

Takeaways

The problem nobody talks about enough

A daemon that owns the cache

Don't miss what's next in AI