
Google just shipped Gemini 3.5 Live Translate, a dedicated audio model that translates spoken language into spoken language as the audio streams in. No waiting for a sentence to finish. No discrete request-response cycle. Just continuous speech in, continuous translated speech out, staying a few seconds behind the speaker throughout the session.
The pipeline problem it kills
Every real-time translation system before this worked roughly the same way: capture speech, run it through a speech-to-text model, feed the transcript to a translation model, then synthesize the result with a text-to-speech engine. Three separate models, three separate latency budgets, three places for errors to compound.
Three architectural choices distinguish Gemini 3.5 Live Translate from earlier streaming-translation systems. Traditional pipelines run audio through a streaming speech-to-text model, feed the transcript to a machine-translation model, then synthesize the translation through a separate text-to-speech model. Each stage adds latency and accumulates errors. Gemini 3.5 Live Translate folds these steps into one audio model. The trade-off is real though: the output is permanent audio, not editable text , once a word is spoken, it cannot be revised mid-utterance.
Gemini is a natively multimodal model, meaning it was trained on text, images, code, and audio simultaneously rather than being a language model with audio features bolted on. This matters for translation because speech carries information that text doesn't , emphasis, hesitation, pace, and tone. Gemini 3.5 Live Translate is based on Gemini 3 Pro.
What it actually does
The model automatically detects 70+ languages and generates smooth, natural-sounding translated speech that preserves the speakers' intonation, pacing and pitch. Unlike turn-by-turn systems that wait for the speaker to finish speaking before responding, 3.5 Live Translate generates speech continuously, balancing the trade-off between waiting for context to improve quality and translating immediately to stay in sync with the speaker. It delivers fluid audio without awkward pauses and stays just a few seconds behind the speaker throughout the session.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
