Kyutai Uses Reinforcement Learning to Make Moshi Sound Actually Human

kyutai

Jun 10, 2026

2 min read

Jun 10, 2026

2 min read

Voice AI has a naturalness problem. Full-duplex speech models -- ones that listen and speak at the same time -- are theoretically closer to human conversation than traditional turn-based systems, but in practice they still feel robotic. They go silent when they should respond, jump in at the wrong moment, and almost never say "yeah" or "uh-huh" while you're talking. Kyutai's new paper tackles all of this at once, using reinforcement learning to post-train full-duplex models to behave more like actual humans in conversation.

The gap between full-duplex and actually natural

To understand why this matters, it helps to know what full-duplex means. Most voice AI systems treat dialogue as a round-based process where each participant produces a full sentence before the other responds -- a half-duplex approach. Full-duplex dialogue, by contrast, allows both sides to speak and listen simultaneously, just like a real phone call. Traditional cascaded systems (ASR to LLM to TTS) let you customize voice and role, but conversations feel robotic with awkward pauses, no interruptions, and unnatural turn-taking.

Kyutai's Moshi, introduced as a speech-text foundation model for real-time dialogue, generates speech as tokens from a neural audio codec while modeling its own speech and the user's speech as parallel streams -- removing the concept of explicit speaker turns entirely. NVIDIA's PersonaPlex is built on top of Moshi's architecture, fine-tuned from the Moshiko weights. Both models can theoretically handle the full richness of human conversation. In practice, they still fall short.

Why supervised learning alone isn't enough

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Kyutai Uses Reinforcement Learning to Make Moshi Sound Actually Human

Takeaways

The gap between full-duplex and actually natural

Why supervised learning alone isn't enough

Don't miss what's next in AI