NVIDIA's Cosmos 3 Merges Robot Reasoning and Video Into One Open Model

Artificial Analysis

Jun 01, 2026

2 min read

IMAGE

image_generation

VIDEO

video_generation world_models

Jun 01, 2026

IMAGE

image_generation

VIDEO

video_generation world_models

2 min read

NVIDIA just shipped something the robotics and physical AI world has been waiting for: a single open model that can reason about the physical world, generate physics-accurate video, and output robot action sequences , all without switching between separate pipelines. Cosmos 3 launched at GTC Taipei and immediately claimed the top open-weights spot on both the Artificial Analysis Text-to-Image and Image-to-Video leaderboards, beating out HiDream, Alibaba's Qwen Image Max, Black Forest Labs' FLUX.1 [dev], and Lightricks' LTX-2.

The problem it's solving

Consider a home robot instructed to clean a dining table after dinner. Under the current paradigm, the robot must stitch together a disjointed suite of models: a VLM to locate dishware and generate an executable plan, a VLA or WAM to generate action sequences, and a forward dynamics model to simulate and evaluate future states. This fragmented architecture is suboptimal and computationally wasteful.

The biggest change in Cosmos 3 compared to previous Cosmos releases is that it's an omni-model. Previously, developers had to work with separate models for different capabilities like world generation (Cosmos Predict), controlled generation (Cosmos Transfer), scene understanding (Cosmos Reason), and policy generation (Cosmos Policy). Cosmos 3 enables all of this in a single model that can reason and generate different modalities in one unified forward pass.

One architecture to rule them all

Cosmos 3 is an omnimodal world model built on a unified Mixture-of-Transformers (MoT) architecture that combines an autoregressive (AR) transformer for reasoning with a diffusion transformer (DM) for multimodal generation. The MoT design , think of it as two specialist sub-networks sharing the same backbone , is what makes this possible without a massive compute penalty.

Here's how the two towers divide the work:

Reasoner tower: A vision-language model that interprets multimodal observations like images, videos, and text. It uses an autoregressive architecture to understand motion, object interactions, and physical context , serving as the model's brain before any generation happens.
Generator tower: Generates future observations and action sequences using a diffusion-based process to produce physics-aware video and action outputs conditioned on the reasoner tower's understanding. The reasoner can be called independently, but the generator always activates both towers for guided generation.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

NVIDIA's Cosmos 3 Merges Robot Reasoning and Video Into One Open Model

Takeaways

The problem it's solving

One architecture to rule them all

Don't miss what's next in AI