NVIDIA's Nemotron 3 Ultra Runs 8x Faster Than Rival Open Models

NVIDIA AI

NVIDIA's Nemotron 3 Ultra Runs 8x Faster Than Rival Open Models

1D AGO

2 min read

LLMS

long_context small_models

OPEN_SOURCE

1 day ago

LLMS

long_context small_models

OPEN_SOURCE

2 min read

NVIDIA just dropped its biggest open model yet, and the pitch is unusually specific: not a chatbot, not a benchmark chaser, but a reasoning engine built to run inside long-horizon agent pipelines without burning your GPU budget. Nemotron 3 Ultra is a 550 billion parameter Mixture-of-Experts model with 55 billion active parameters per token, announced at Jensen Huang's Computex keynote and released as fully open weights shortly after.

The problem it's solving

Modern agent workflows are expensive in a very specific way. Single-turn chatbots are evolving into long-running agents that reason, maintain context, use tools, and run efficiently across many turns. Every planning step, tool call, sub-agent invocation, and observation gets fed back into the model, and token counts compound fast. Within any agent workflow, most calls are routine, but a critical subset demands deeper reasoning -- sustaining architectural decisions across coding sessions, synthesizing contradictory evidence across hundreds of research sources, or verifying chip designs across thousands of constraints.

NVIDIA did not build this to win single-turn chatbot comparisons. It built it to orchestrate agents that plan, call tools, delegate to sub-agents, read observations, and recover from errors across hundreds of turns, while burning fewer tokens than comparable open models.

Speed is the headline number

Nemotron 3 Ultra scores 47.7 on the Artificial Analysis Intelligence Index, well ahead of the next strongest US open weights models -- Gemma 4 31B (39.2), Nemotron 3 Super (36.0), and gpt-oss-120b (33.3) -- but behind the Chinese-led open weights frontier (Kimi K2.6 at 53.9). The intelligence gap versus Kimi is real, but the throughput story is where Ultra pulls away from the pack.

Nemotron 3 Ultra was measured at over 400 output tokens per second from a pre-release deployment on BlackBox AI. Peer models in its size class from China-based labs such as DeepSeek and Moonshot (Kimi) are generally served at speeds of 50-100 tokens per second in the market today. That 4-8x speed advantage is not a benchmark artifact -- it changes what's practically possible in an interactive agent loop.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

NVIDIA's Nemotron 3 Ultra Runs 8x Faster Than Rival Open Models

Takeaways

The problem it's solving

Speed is the headline number

Don't miss what's next in AI