Alibaba's Qwen-VLA Controls Ten Robot Bodies by Swapping a Text Prompt

Tongyi Lab

May 29, 2026

2 min read

ROBOTICS

humanoid manipulation vla_models

LLMS

May 29, 2026

ROBOTICS

humanoid manipulation vla_models

LLMS

2 min read

Robotics has long lived with a quiet inefficiency: every new arm, gripper, or humanoid platform tends to get its own bespoke policy network, its own action head, and its own training pipeline. Alibaba's Qwen team is now proposing the opposite approach with Qwen-VLA, a single vision-language-action model that handles manipulation, navigation, and trajectory prediction across more than ten distinct robot bodies, switching between them by editing a text prompt.

Qwen-VLA is built on Qwen3.5-4B as the vision-language backbone, paired with a 1.15B DiT flow-matching action decoder. The system casts manipulation, navigation, and trajectory prediction into a shared action-and-trajectory prediction framework, enabling a unified model to learn from heterogeneous embodied data across tasks, environments, and robot embodiments via embodiment-aware prompt conditioning, with no per-platform output heads needed.

One brain, many bodies

The core idea is that the robot's body is just another piece of context the language model can read. To support multiple robot platforms, the team introduces embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. Switching from a WidowX single-arm setup to a bimanual ALOHA rig or a humanoid is, mechanically, a prompt swap rather than a model swap.

That matters because the alternatives in the field, like specialist policies trained per robot or per task family, have been the dominant paradigm. A unified Qwen-VLA generalist matches or outperforms task-specific specialists fine-tuned independently per benchmark across multiple simulation and real-world evaluations, pushing embodied intelligence from skill specialists toward generalist actors.

How the action decoder learns to act

The bridge between language tokens and continuous joint commands is a Diffusion Transformer (DiT) trained with flow matching, a technique that learns to transport noise into a target distribution along smooth velocity fields. In Qwen-VLA, that target is a continuous action or trajectory rather than an image.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Alibaba's Qwen-VLA Controls Ten Robot Bodies by Swapping a Text Prompt

Takeaways

One brain, many bodies

How the action decoder learns to act

Don't miss what's next in AI