
Serving a model that listens, reasons, and speaks in real time turns out to be a very different problem from serving a text LLM. The vLLM-Omni team and Ant Group's Super Computing Technology team just published a detailed breakdown of how they got Qwen3-Omni production-ready , and the numbers are striking: first audio in ~0.6 seconds instead of ~6, speech generated faster than real time, and 5.4x more throughput on the same hardware.
Three models pretending to be one
Qwen3-Omni is Alibaba Qwen's fully omnimodal model. It adopts a Thinker-Talker Mixture-of-Experts architecture that unifies perception and generation across text, images, audio, and video. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source state-of-the-art on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe.
Under the hood, it runs as three distinct stages with very different compute profiles:
- Thinker , the heavy multimodal reasoning engine. It ingests text, images, audio, and video, then produces text tokens and hidden states (rich internal representations of meaning).
- Talker , receives those hidden states and autoregressively generates discrete speech codec codes (compressed audio tokens) frame by frame. To achieve ultra-low-latency streaming, Talker autoregressively predicts a multi-codebook sequence. At each decoding step, an MTP module outputs the residual codebooks for the current frame, after which the Code2Wav renderer incrementally synthesizes the corresponding waveform, enabling frame-by-frame streaming generation.
- Code2Wav , a neural vocoder that converts the codec codes into actual audio waveforms.
The core serving challenge: these three stages hit completely different bottlenecks. Treating them as one loop forces the slowest sub-path to gate everything else.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves

