Artificial Analysis' AA-WER Streaming Benchmark Reveals No Voice AI Model Wins Everywhere

EDITORIAL LEADERBOARD

Artificial Analysis

May 28, 2026

2 min read

AUDIO

realtime_voice speech_to_text

BENCHMARKS

May 28, 2026

AUDIO

realtime_voice speech_to_text

BENCHMARKS

2 min read

Picking a streaming speech-to-text model for a voice agent used to mean choosing between two separate leaderboards: one for accuracy, one for speed. Artificial Analysis just collapsed those into one with AA-WER Streaming, a new benchmark that plots Word Error Rate (WER) and transcription latency as a single paired metric, specifically designed for the voice-agent use case. Twenty-five models were evaluated, and the results reveal a surprisingly competitive field -- with no single model winning everywhere.

Why the old benchmarks weren't enough

Streaming STT is fundamentally different from batch transcription. Instead of sending a complete audio file and waiting for a result, a streaming model receives audio in real time, chunk by chunk, and emits transcripts continuously. In a turn-based voice agent architecture -- where the user speaks, STT transcribes, an LLM generates a response, and TTS synthesizes audio -- STT latency is the first step in the response chain, and a slow model adds directly to the time the user waits before hearing a reply.

The fastest models on latency are often the least accurate, and the most accurate models are often the slowest. Understanding where each provider sits on that curve, under real production conditions, is what determines which STT model is the right fit for a specific use case. AA-WER Streaming makes that tradeoff visible with a Pareto frontier: a line connecting the models that offer the best accuracy at each latency budget.

How the benchmark works

Artificial Analysis evaluates STT models on accuracy, speed, and price, benchmarking both offline (batch) and streaming transcription. The streaming benchmark runs on roughly 8 hours of audio across three datasets, weighted 50% AA-AgentTalk / 25% VoxPopuli / 25% Earnings22. AA-AgentTalk is a proprietary held-out set focused on voice-agent speech; VoxPopuli covers European parliamentary proceedings with diverse accents; Earnings22 contains corporate earnings calls with technical vocabulary and overlapping speakers.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

Why the old benchmarks weren't enough

How the benchmark works

Don't miss what's next in AI