Microsoft's Superintelligence team has shipped MAI-Transcribe-1.5, a speech-to-text model that pushes the accuracy-speed Pareto frontier hard enough to make most production transcription stacks worth re-evaluating. It hits a Word Error Rate of 2.4% on the Artificial Analysis leaderboard, taking the #3 position in that benchmark, while running fast enough to turn an hour of audio into text in roughly the time it takes to refill a coffee.

The accuracy-speed tradeoff, quietly broken

Transcription models have historically forced an awkward choice: high accuracy meant slow batch jobs, while fast streaming models gave up word-level precision. MAI-Transcribe-1.5 now leads on Accuracy x Speed on the Artificial Analysis leaderboard, running up to 5x faster than models of comparable accuracy. The impact shows up most on long audio, where the model can transcribe an hour of audio in under 15 seconds.

On FLEURS, the standard multilingual benchmark, the model achieves best-in-class Word Error Rate across 43 languages. Coverage expanded by 18 new languages without compromising accuracy, jumping from the 25 supported in the prior version.

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves