
Building a robot that can reliably pick up an object sounds deceptively simple. In practice, it requires stitching together at least four separate systems: a vision model to parse the scene, a reasoning model to plan the move, a dynamics model to predict what happens next, and a policy model to output motor commands. Each component is trained separately, stitched together with glue code, and prone to compounding errors at every handoff. NVIDIA's answer to this fragmentation is Cosmos 3, a single open model that does all of it at once.
One model to rule the physical world
The biggest change compared to previous Cosmos releases is that it's an omni-model, built on a Mixture-of-Transformers (MoT) architecture. Previously, developers had to work with separate models for world generation (Cosmos Predict), controlled generation (Cosmos Transfer), scene understanding (Cosmos Reason), and policy generation (Cosmos Policy). Cosmos 3 enables all of this in a single model that can reason and generate different modalities in one unified forward pass.
Cosmos 3 adopts a Mixture-of-Transformers (MoT) architecture that processes a unified sequence of tokens from different modalities. MoT is distinct from the more familiar Mixture-of-Experts (MoE) approach: rather than routing tokens to different expert sub-networks, MoT maintains two full sets of parameters at every transformer layer, one for reasoning and one for generation, and activates both during generation tasks.
The two-tower trick
The architecture is organized around two towers that work in tandem:
- Reasoner tower: An autoregressive vision-language model (VLM) that interprets text, images, and video. It acts as the brain, understanding motion, object interactions, and physical context before any generation happens.
- Generator tower: A diffusion-based model that produces physics-aware video and action outputs, always conditioned on what the reasoner tower has understood.
In Reasoner Mode, language and visual understanding tokens are processed through causal self-attention, enabling next-token prediction for tasks such as perception, planning, and world reasoning. In Generator Mode, noisy image, video, audio, and action tokens are denoised through full attention, allowing the model to jointly generate coherent multimodal outputs.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
