
Voice AI has a naturalness problem. Full-duplex speech models -- ones that listen and speak at the same time -- are theoretically closer to human conversation than traditional turn-based systems, but in practice they still feel robotic. They go silent when they should respond, jump in at the wrong moment, and almost never say "yeah" or "uh-huh" while you're talking. Kyutai's new paper tackles all of this at once, using reinforcement learning to post-train full-duplex models to behave more like actual humans in conversation.
The gap between full-duplex and actually natural
To understand why this matters, it helps to know what full-duplex means. Most voice AI systems treat dialogue as a round-based process where each participant produces a full sentence before the other responds -- a half-duplex approach. Full-duplex dialogue, by contrast, allows both sides to speak and listen simultaneously, just like a real phone call. Traditional cascaded systems (ASR to LLM to TTS) let you customize voice and role, but conversations feel robotic with awkward pauses, no interruptions, and unnatural turn-taking.
Kyutai's Moshi, introduced as a speech-text foundation model for real-time dialogue, generates speech as tokens from a neural audio codec while modeling its own speech and the user's speech as parallel streams -- removing the concept of explicit speaker turns entirely. NVIDIA's PersonaPlex is built on top of Moshi's architecture, fine-tuned from the Moshiko weights. Both models can theoretically handle the full richness of human conversation. In practice, they still fall short.
Why supervised learning alone isn't enough
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
