
The top of the text-to-speech leaderboard has a new occupant from China. Alibaba's Fun-Realtime-TTS just climbed to the #1 spot on the Artificial Analysis Speech Arena, edging out Google's Gemini 3.1 Flash TTS and Inworld's Realtime TTS-2 Research Preview in blind listener comparisons.
The Speech Arena is not a synthetic benchmark. Rankings come from blind user votes where listeners hear pairs of speech samples generated from the same text and pick which sounds more natural, then models are ranked using an Elo system. Higher scores indicate speech that listeners prefer more often. That makes the leaderboard a reasonable proxy for which models actually sound human to real ears.
A photo finish at the top
The headline number is tight. Fun-Realtime-TTS posted an Elo of 1,219 (+16/-16) across 962 arena appearances, placing it ahead of Gemini 3.1 Flash TTS at 1,214, Inworld Realtime TTS-2 Research Preview at 1,209, and Cartesia Sonic 3.5 at 1,203. Just 24 Elo points separate the top five models, which in Elo terms means listeners are essentially split down the middle on many of these head-to-head comparisons.
This is also a meaningful jump for Alibaba. The lab's previous Fun-Realtime-TTS-Preview reached #7 on the leaderboard, making this Alibaba's first #1 model in the Artificial Analysis Speech Arena. The preview version had only cracked the global top five a few weeks earlier with an Elo of 1,190, so this is a sharp climb in a short window.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
