NVIDIA's Nemotron 3 Ultra Runs on Half the Hardware at 5x Faster Speed

NVIDIA AI

Jun 01, 2026

2 min read

LLMS

long_context small_models

BENCHMARKS

Jun 01, 2026

LLMS

long_context small_models

BENCHMARKS

2 min read

NVIDIA just shipped its biggest open model yet, and the pitch is not about topping a leaderboard. Nemotron 3 Ultra is a 550B-parameter Mixture-of-Experts model with 55B active parameters, built for frontier reasoning and orchestration in agentic systems. The design goal is explicit: not winning single-turn chatbot comparisons, but orchestrating agents that plan, call tools, delegate to sub-agents, read observations, and recover from errors across hundreds of turns, while burning fewer tokens than comparable open models.

Nemotron 3 Ultra was announced at Computex 2026 and released on June 4. NVIDIA released the base model weights, post-trained checkpoints, reward models, NVFP4 quantized variants, training recipes, and datasets under the OpenMDW-1.1 license, a permissive open AI model license from the Linux Foundation. That makes it one of the most complete open releases from a major AI lab in recent memory.

The Biggest Open US Model, But Not the Biggest Overall

Nemotron 3 Ultra scores 47.7 on the Artificial Analysis Intelligence Index, well ahead of the next strongest US open weights models: Gemma 4 31B (39.2), Nemotron 3 Super (36.0), and gpt-oss-120b (33.3), but behind the Chinese-led open weights frontier, Kimi K2.6 at 53.9. The gap with China is real, but the efficiency story is where NVIDIA makes its case.

Through BlackBox AI ahead of release, Nemotron 3 Ultra is served at over 400 output tokens per second. That is three to six times faster than comparably sized open models from other labs. NVIDIA claims 5x faster inference and a 30% cost reduction for agentic tasks. The combination of low active parameter count, NVFP4 quantization, and architectural innovations is what makes those numbers possible.

Five Architectural Bets Working Together

Architectural innovations include hybrid Mamba-Transformer layers for efficient long-context handling, NVFP4 quantization for cross-architecture GPU deployment with up to 5x higher throughput, LatentMoE for expert routing, and multi-token prediction for improved generative speed in multi-turn tasks. Each of these is worth unpacking:

Hybrid Mamba-Transformer: Mamba layers (a type of state-space model that processes sequences more efficiently than attention) handle long-context workloads cheaply, while Transformer attention layers are kept where precise recall matters. This is what makes the 1M token context window practical without exploding cost.
LatentMoE:

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

NVIDIA's Nemotron 3 Ultra Runs on Half the Hardware at 5x Faster Speed

Takeaways

The Biggest Open US Model, But Not the Biggest Overall

Five Architectural Bets Working Together

Don't miss what's next in AI