vLLM-Omni v0.22.0 Ships NVIDIA Cosmos 3 Support and Robot Serving

vLLM

Jun 08, 2026

2 min read

INFRA

inference_optimization model_serving

VIDEO

video_generation

Jun 08, 2026

INFRA

inference_optimization model_serving

VIDEO

video_generation

2 min read

vLLM-Omni v0.22.0 just landed, and it is the most ambitious release the project has shipped. What started as an extension of vLLM to handle multimodal inputs has grown into a full production serving stack for world models, robots, speech synthesis, and video diffusion , all under one OpenAI-compatible API. With 339 commits from 124 contributors (52 of them brand new), this release signals that the community around omnimodal serving is growing fast.

The world model moment

The headline feature is day-0 support for NVIDIA Cosmos 3, announced at COMPUTEX 2026. Cosmos 3 is a leaderboard-topping open physical AI foundation model built on a mixture-of-transformers architecture, and it is the world's first fully open omnimodel with native vision reasoning and multimodal generation across text, image, video, ambient sound, and action. That last word , action , is what makes it different from every other multimodal model in the ecosystem.

NVIDIA trained Cosmos 3 on 20 trillion tokens of multimodal data, including nearly a billion images and 400 million real and synthetic videos. The action data is what makes Cosmos different from a regular video generator , it is meant to model how machines move, not just how scenes look. Previous Cosmos releases separated world generation, physical understanding, and controlled scene generation into different models and workflows. This release unifies those capabilities with a Mixture-of-Transformers architecture built around two towers: a reasoner tower that is a VLM interpreting multimodal observations, and a generation tower that creates physically grounded outputs.

vLLM-Omni v0.22.0 ships full Cosmos 3 support across model execution, recipes, tests, and accuracy coverage , including base model support, sound generation, and the action modality. You can now serve a model that takes in video of a robot arm, reasons about what it sees, and outputs both a predicted future video and the joint angles to execute the task, all through a single API endpoint.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

vLLM-Omni v0.22.0 Ships NVIDIA Cosmos 3 Support and Robot Serving

Takeaways

The world model moment

Don't miss what's next in AI