NVIDIA's Nemotron 3 Ultra Beats Trillion-Parameter Models at 5x the Speed

Artificial Analysis

NVIDIA's Nemotron 3 Ultra Beats Trillion-Parameter Models at 5x the Speed

Jun 04, 2026

2 min read

Jun 04, 2026

2 min read

NVIDIA just dropped its biggest open-weights model yet: Nemotron 3 Ultra, a 550 billion parameter reasoning model that sits at the top of the US open-weights intelligence rankings. The headline number is impressive, but the more interesting story is how it gets there -- through an architectural bet that most labs haven't made at this scale.

A model built for agents, not just answers

What makes it especially notable isn't just the size -- it's what the model was specifically optimized to do: run multi-step agentic tasks reliably, reason across long contexts, and call external tools with high accuracy. It targets a specific problem: long-running agents that plan, call tools, and reason across many turns. As agents run longer, token counts grow and inference cost climbs. Nemotron 3 Ultra is designed to keep accuracy high while making that inference faster and cheaper.

Smaller models (under 70B) often lose coherence in long agentic chains -- they forget earlier context, fail on compound instructions, or make tool-calling errors that cascade into broken workflows. The 550B parameter count gives Nemotron Ultra the working memory and reasoning depth to handle tasks that take 10, 20, or 50 steps without degrading.

The architecture is the real story

The model employs a hybrid Latent Mixture-of-Experts (LatentMoE) architecture, utilizing interleaved Mamba-2 and MoE layers, along with select Attention layers. This is a meaningful departure from the standard Transformer-only design used by most frontier models. Here's what each piece does:

Mamba-2 layers: Mamba layers handle long sequences with sub-quadratic scaling -- meaning memory and compute grow much slower than in pure attention as context length increases.
Attention layers: A few Attention layers are kept for precise recall over large contexts.
LatentMoE routing: Tokens are projected into a smaller latent dimension for expert routing and computation, improving accuracy per byte.
MTP layers: The Ultra model incorporates Multi-Token Prediction (MTP) layers for faster text generation and improved quality. These enable native speculative decoding -- the model predicts multiple tokens at once, then verifies them, dramatically boosting throughput.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

NVIDIA's Nemotron 3 Ultra Beats Trillion-Parameter Models at 5x the Speed

Takeaways

A model built for agents, not just answers

The architecture is the real story

Don't miss what's next in AI