NVIDIA's Nemotron 3 Ultra Runs 5x Faster Than Open Rivals at 550B Parameters

NVIDIA AI

Jun 04, 2026

2 min read

Jun 04, 2026

2 min read

The open-weight frontier just got a serious upgrade. NVIDIA has shipped Nemotron 3 Ultra, a 550-billion-parameter model built from the ground up for one specific problem: making long-running AI agents actually practical to deploy at scale. The model is fully open -- weights, training data, synthetic datasets, and post-training recipes are all available now on Hugging Face.

The timing matters. The two models currently at the top of the frontier are proprietary, API-only, and priced accordingly. The closest open-weights competitor requires roughly 862GB of VRAM to run -- effectively a dedicated GPU cluster. Nemotron 3 Ultra is NVIDIA's answer to both constraints: intelligence approaching the frontier, open weights, and an architecture engineered for throughput rather than just accuracy.

The agent problem nobody talks about

Single-turn chat is easy. Agents are not. In multi-agent workflows, token counts grow quickly. Agents plan, call tools, invoke sub-agents, receive information, and then pass history, outputs, and reasoning steps back into the model continuously. As tasks run longer, this constant communication increases costs and the risk of goal drift.

Nemotron 3 Ultra is a 550B-parameter Mixture-of-Experts model with 55B active parameters, built for frontier reasoning and orchestration in agentic systems. Within any agent workflow, most calls are routine, but a critical subset demands deeper reasoning. The model is designed to handle those hard calls: sustaining architectural decisions across long coding sessions, synthesizing contradictory evidence across hundreds of research sources, or verifying chip designs across thousands of constraints.

Speed and cost that actually move the needle

Nemotron 3 Ultra achieves 5x higher throughput compared to other open models in its class, enabling long-running agents to complete tasks faster and more efficiently. That speed advantage compounds over time -- an agent that can do more reasoning cycles within the same time budget simply gets more done.

In experiments on SWE-bench and Terminal Bench 2.0, it completed benchmarks using fewer total tokens and fewer tokens per turn than comparable models. This lowers the cost for agentic tasks by up to 30%. That efficiency gain comes from the model being smarter about when to reason deeply versus when to answer directly -- a behavior baked in through post-training, not just architectural tricks.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves