Perplexity Deploys NVIDIA's Nemotron Ultra to Run Agents 5x Faster

Perplexity

Jun 05, 2026

2 min read

LLMS

long_context

AGENTS

agent_frameworks computer_use

Jun 05, 2026

LLMS

long_context

AGENTS

agent_frameworks computer_use

2 min read

Perplexity has flipped the switch on Nemotron 3 Ultra, NVIDIA's largest open model to date, for its Pro and Max subscribers in both the standard model picker and inside Perplexity Computer. The model is built around a very specific bet: that the next bottleneck in production AI is not single-turn chat quality but agents that run for hundreds of turns without losing the plot.

A 550B model that only fires 55B at a time

NVIDIA has released Nemotron 3 Ultra, a 550B total (55B active) open Mixture-of-Experts hybrid Mamba-Transformer for long-running agents. The architecture matters here. By interleaving Mamba layers, which scale linearly with sequence length, with traditional attention, NVIDIA gets a model that can hold a one-million-token architectural context window without the quadratic memory blowup a pure transformer would suffer.

The reason for that design choice is the problem NVIDIA is targeting. Multi-agent workflows cause token counts to grow quickly. Agents plan, call tools, invoke sub-agents, receive information, and then pass history, outputs, and reasoning steps back into the model continuously. As tasks run longer, this constant communication increases costs and the risk of goal drift. Ultra is the orchestrator in that picture, with smaller Nemotron 3 models handling routine execution.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Perplexity Deploys NVIDIA's Nemotron Ultra to Run Agents 5x Faster

Takeaways

A 550B model that only fires 55B at a time

Don't miss what's next in AI