
Perplexity has flipped the switch on Nemotron 3 Ultra, NVIDIA's largest open model to date, for its Pro and Max subscribers in both the standard model picker and inside Perplexity Computer. The model is built around a very specific bet: that the next bottleneck in production AI is not single-turn chat quality but agents that run for hundreds of turns without losing the plot.
A 550B model that only fires 55B at a time
NVIDIA has released Nemotron 3 Ultra, a 550B total (55B active) open Mixture-of-Experts hybrid Mamba-Transformer for long-running agents. The architecture matters here. By interleaving Mamba layers, which scale linearly with sequence length, with traditional attention, NVIDIA gets a model that can hold a one-million-token architectural context window without the quadratic memory blowup a pure transformer would suffer.
The reason for that design choice is the problem NVIDIA is targeting. Multi-agent workflows cause token counts to grow quickly. Agents plan, call tools, invoke sub-agents, receive information, and then pass history, outputs, and reasoning steps back into the model continuously. As tasks run longer, this constant communication increases costs and the risk of goal drift. Ultra is the orchestrator in that picture, with smaller Nemotron 3 models handling routine execution.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
