Microsoft's MAI-Transcribe-1.5 Transcribes an Hour of Audio in 15 Seconds

EDITORIAL LEADERBOARD

Microsoft AI

Microsoft's MAI-Transcribe-1.5 Transcribes an Hour of Audio in 15 Seconds

1D AGO

1 min read

1 day ago

1 min read

Microsoft's Superintelligence team has shipped MAI-Transcribe-1.5, a speech-to-text model that pushes the accuracy-speed Pareto frontier hard enough to make most production transcription stacks worth re-evaluating. It hits a Word Error Rate of 2.4% on the Artificial Analysis leaderboard, taking the #3 position in that benchmark, while running fast enough to turn an hour of audio into text in roughly the time it takes to refill a coffee.

The accuracy-speed tradeoff, quietly broken

Transcription models have historically forced an awkward choice: high accuracy meant slow batch jobs, while fast streaming models gave up word-level precision. MAI-Transcribe-1.5 now leads on Accuracy x Speed on the Artificial Analysis leaderboard, running up to 5x faster than models of comparable accuracy. The impact shows up most on long audio, where the model can transcribe an hour of audio in under 15 seconds.

On FLEURS, the standard multilingual benchmark, the model achieves best-in-class Word Error Rate across 43 languages. Coverage expanded by 18 new languages without compromising accuracy, jumping from the 25 supported in the prior version.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

The accuracy-speed tradeoff, quietly broken

Don't miss what's next in AI