NVIDIA's Dynamo Snapshot Cuts AI Inference Cold Starts by 21x on Kubernetes

EDITORIAL LEADERBOARD

NVIDIA AI

NVIDIA's Dynamo Snapshot Cuts AI Inference Cold Starts by 21x on Kubernetes

May 27, 2026

2 min read

INFRA

inference_optimization model_serving

GPUS

May 27, 2026

INFRA

inference_optimization model_serving

GPUS

2 min read

Every time a Kubernetes cluster needs to spin up a new inference replica to handle a traffic spike, it pays a steep tax. A cold start means the full sequence a model server must complete before serving any request: pulling the container image, loading model weights into GPU memory, warming up CUDA kernels, compiling CUDA graphs, and registering with the service discovery layer. For large models, that bill can run into minutes. NVIDIA's answer is Dynamo Snapshot, a checkpoint/restore system that skips the entire cold-start sequence and brings a fully warm inference worker back to life in seconds.

The GPU idle problem nobody talks about

In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. Cold-starting inference workloads on Kubernetes can take several minutes. During that time, GPUs are allocated but idle, generating no tokens and serving no requests. That is not just wasted compute -- this delay increases the risk of SLA violations during traffic spikes, as the system cannot scale quickly enough to absorb sudden increases in demand.

NVIDIA Dynamo Snapshot is a checkpoint/restore system for AI inference workloads on Kubernetes. It serializes the full state of a running inference worker -- both GPU-side and CPU-side -- and restores it on the same or a different node, skipping the cold-start sequence entirely. The key insight is that you only need to pay the cold-start cost once. After that, every subsequent scale-out event restores from a frozen snapshot instead of booting from scratch.

Two tools, one frozen worker

A running inference worker has two distinct types of state that both need to be captured. Dynamo Snapshot uses one tool per type: cuda-checkpoint serializes GPU device state (CUDA contexts, streams, device memory, virtual address mappings) into CPU memory of the process owning each CUDA context, using the checkpointing capability of the CUDA driver. CRIU (Checkpoint/Restore in Userspace) walks Linux kernel bookkeeping and serializes the host-side process tree (CPU memory, threads, file descriptors, namespaces) to disk.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

The GPU idle problem nobody talks about

Two tools, one frozen worker

Don't miss what's next in AI