
Artificial Analysis, the independent AI benchmarking outfit behind the Speech Arena leaderboard, announced that Cartesia's Sonic-3.5 had climbed to the #1 position in their blind human evaluation rankings, surpassing both Google's Gemini 3.1 Flash TTS and Inworld's Realtime TTS 1.5 Max. The Speech Arena works like a taste test: real users listen to two unlabeled audio clips generated from the same text and vote for whichever sounds more natural , no brand loyalty, no benchmarks, just ears.
How the Elo math shook out
Elo is the same rating system used in chess , a model gains points by winning head-to-head comparisons and loses points when it doesn't. Sonic-3.5 reached an Elo score of 1,218 based on 1,144 arena appearances, placing it ahead of Inworld Realtime TTS 1.5 Max at 1,194 and Gemini 3.1 Flash TTS at 1,209. The margin over Google was narrow , just 9 Elo points , which matters because these rankings shift as more votes come in. Indeed, the live leaderboard has since continued to move, with the top cluster remaining tightly contested.
What Sonic-3.5 actually is
Sonic-3.5 is Cartesia's fastest, most natural text-to-speech model, with sub-90ms latency and native support for 42 languages. Cartesia is a research company that spun out of the Stanford AI Lab and built its technology on State Space Models (SSMs) , a different architecture from the Transformer models that power most large language models, and one that is significantly more efficient, enabling the sub-100ms latency the company is known for.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
