The gap between AI that understands the world and AI that can act in it has been the defining bottleneck for embodied intelligence. The Qwen family of foundation models already gives strong perception and reasoning about the physical world , but seeing is not acting: the gap between vision-language understanding and physical control remains the central bottleneck. Alibaba's Qwen team just took a serious swing at closing it.

Alibaba has launched the Qwen Robot Suite, a set of three foundation models for robotics developed by its Tongyi Lab: Qwen-RobotNav for vision-language navigation, Qwen-RobotWorld as a video-based world model for prediction and simulation, and Qwen-RobotManip as a generalist vision-language-action (VLA) model. The suite has already entered pilot testing with selected Alibaba Cloud enterprise clients.

Three models, one stack

The suite splits robot intelligence into three interconnected layers. Qwen-RobotNav, a vision-language navigation model, is designed to help machines understand and move through physical spaces. It works in tandem with Qwen-RobotWorld, a video world model that lets robots predict and simulate how physical scenes will evolve before they take action. Then the physical execution is handled by Qwen-RobotManip, a generalist vision-language-action model built on the Qwen3.5-4B architecture.

Each model is independently deployable, but the real value is in the stack. Think of it as perception (Nav), imagination (World), and action (Manip) , three modules that together give a robot the ability to navigate a space, mentally simulate what will happen next, and then physically execute a task.

Qwen-RobotManip: alignment before scale

Qwen-RobotManip is a generalizable Vision-Language-Action (VLA) foundation model built upon Qwen-VL. It introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting.

The core insight is that alignment is a prerequisite for data scaling , not an independent engineering choice. When demonstrations from different robots arrive with incompatible action representations, adding more data produces interference rather than synergy. The team's solution has three parts:

  • Canonical state-action representation: An 80-dimensional vector that accommodates joint positions, end-effector poses, gripper state, and dexterous hand joints across any robot morphology, with zero-padded dimensions masked out of the training loss.
  • Camera-frame delta pose: End-effector actions are expressed as pose deltas in the camera's coordinate frame rather than the robot's base frame. This means physically similar motions look numerically similar regardless of which robot is doing them , a key enabler for cross-embodiment transfer.
Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves