Ant Group Pushes Qwen3-Omni to 5.4x Faster With 0.6s Audio Response

vLLM

Ant Group Pushes Qwen3-Omni to 5.4x Faster With 0.6s Audio Response

6H AGO

2 min read

INFRA

inference_optimization model_serving

AUDIO

realtime_voice

6 hrs ago

INFRA

inference_optimization model_serving

AUDIO

realtime_voice

2 min read

Serving a model that listens, reasons, and speaks in real time turns out to be a very different problem from serving a text LLM. The vLLM-Omni team and Ant Group's Super Computing Technology team just published a detailed breakdown of how they got Qwen3-Omni production-ready , and the numbers are striking: first audio in ~0.6 seconds instead of ~6, speech generated faster than real time, and 5.4x more throughput on the same hardware.

Three models pretending to be one

Qwen3-Omni is Alibaba Qwen's fully omnimodal model. It adopts a Thinker-Talker Mixture-of-Experts architecture that unifies perception and generation across text, images, audio, and video. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source state-of-the-art on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe.

Under the hood, it runs as three distinct stages with very different compute profiles:

Thinker , the heavy multimodal reasoning engine. It ingests text, images, audio, and video, then produces text tokens and hidden states (rich internal representations of meaning).
Talker , receives those hidden states and autoregressively generates discrete speech codec codes (compressed audio tokens) frame by frame. To achieve ultra-low-latency streaming, Talker autoregressively predicts a multi-codebook sequence. At each decoding step, an MTP module outputs the residual codebooks for the current frame, after which the Code2Wav renderer incrementally synthesizes the corresponding waveform, enabling frame-by-frame streaming generation.
Code2Wav , a neural vocoder that converts the codec codes into actual audio waveforms.

The core serving challenge: these three stages hit completely different bottlenecks. Treating them as one loop forces the slowest sub-path to gate everything else.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Ant Group Pushes Qwen3-Omni to 5.4x Faster With 0.6s Audio Response

Takeaways

Three models pretending to be one

Don't miss what's next in AI