
Real-time speech translation has been a solved problem on paper for years. In practice, it has always felt like a game of telephone: your voice gets transcribed to text, that text gets machine-translated, and the result gets synthesized back into speech. Every handoff adds latency and a new surface for errors to compound. Gemini 3.5 Live Translate collapses that three-step cascade into a single end-to-end audio model, and the difference is audible.
The pipeline problem it actually solves
Older real-time translation systems, including Google Meet's previous implementation, ran a cascaded three-step pipeline: transcribe speech to text (STT), translate that text, then synthesize the translated text back to speech (TTS). Each hop adds latency and a place for errors to compound -- a mistranscription becomes a mistranslation becomes a confidently wrong spoken sentence.
Three architectural choices distinguish Gemini 3.5 Live Translate from earlier streaming-translation systems. Traditional pipelines run audio through a streaming speech-to-text model, feed the transcript to a machine-translation model, then synthesize the translation through a separate text-to-speech model. Each stage adds latency and accumulates errors. The new model fuses all three stages into one, processing the raw audio stream directly.
Unlike turn-by-turn systems that wait for the speaker to finish speaking before responding, 3.5 Live Translate generates speech continuously, balancing the trade-off between waiting for context to improve quality and translating immediately to stay in sync with the speaker. The result is a translation that stays just a few seconds behind the speaker throughout the session, with no awkward pauses between utterances.
What it's built on
Developers can access the model via the Gemini Live API and Google AI Studio, currently in public preview. According to the official model card, Gemini 3.5 Live Translate is part of the Gemini 3 family of models, and is specifically based on Gemini 3 Pro. The model takes raw audio in and outputs raw audio -- there is no text intermediary in the core translation path, though you can optionally request text transcripts of both input and output for debugging.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
