Building a robot that can pick up a cup, navigate a warehouse, and explain what it sees used to require at least four different models stitched together with fragile glue code. NVIDIA Cosmos 3 changes that equation entirely. Announced at GTC Taipei, it is the world's first fully open omni-model for physical AI, collapsing vision reasoning, world generation, and action prediction into a single unified architecture.

One model to rule the physical world

The biggest change in Cosmos 3 compared to previous Cosmos releases is that it is an omni-model, built on a Mixture-of-Transformers (MoT) architecture. Previously, developers had to work with separate models for different capabilities: world generation (Cosmos Predict), controlled generation (Cosmos Transfer), scene understanding (Cosmos Reason), and policy generation (Cosmos Policy). That fragmentation is now gone.

Cosmos 3 is an omnimodel that can natively understand and generate text, images, video, ambient sound, and actions with leading physics accuracy, reducing physical AI training and evaluation cycles from months to days. That last part is the headline buried in the spec sheet: what used to take a team months of iteration now fits into a single development loop.

The architecture: a brain and a body in one model

The core innovation is a two-tower Mixture-of-Transformers (MoT) design. MoT is a variant of the standard transformer architecture where different groups of parameters (called "experts") are selectively activated depending on the task, making it possible to handle wildly different modalities efficiently without blowing up compute costs.

  • Reasoner tower: A vision-language model (VLM) that interprets multimodal observations like images, videos, and text. It uses an autoregressive architecture to understand motion, object interactions, and other physical context, serving as the "brain" that reasons about the world before any generation happens.
  • Generator tower: Generates future observations and action sequences using a diffusion-based process to produce physics-aware video and action outputs, conditioned on the reasoner tower's understanding.

The two towers share attention across the same sequence but use separate parameter sets, which is what lets a single model switch seamlessly between acting as a VLM, a video generator, a forward dynamics model, or a robot policy without any architectural changes. Think of it as one model that can both

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves