Google Ships Gemini 3.5 Live Translate to Kill the 3-Model Pipeline

Google for Developers

7D AGO

2 min read

AUDIO

realtime_voice speech_to_text

API

7 days ago

AUDIO

realtime_voice speech_to_text

API

2 min read

Google just shipped Gemini 3.5 Live Translate, a dedicated audio model that translates spoken language into spoken language as the audio streams in. No waiting for a sentence to finish. No discrete request-response cycle. Just continuous speech in, continuous translated speech out, staying a few seconds behind the speaker throughout the session.

The pipeline problem it kills

Every real-time translation system before this worked roughly the same way: capture speech, run it through a speech-to-text model, feed the transcript to a translation model, then synthesize the result with a text-to-speech engine. Three separate models, three separate latency budgets, three places for errors to compound.

Three architectural choices distinguish Gemini 3.5 Live Translate from earlier streaming-translation systems. Traditional pipelines run audio through a streaming speech-to-text model, feed the transcript to a machine-translation model, then synthesize the translation through a separate text-to-speech model. Each stage adds latency and accumulates errors. Gemini 3.5 Live Translate folds these steps into one audio model. The trade-off is real though: the output is permanent audio, not editable text , once a word is spoken, it cannot be revised mid-utterance.

Gemini is a natively multimodal model, meaning it was trained on text, images, code, and audio simultaneously rather than being a language model with audio features bolted on. This matters for translation because speech carries information that text doesn't , emphasis, hesitation, pace, and tone. Gemini 3.5 Live Translate is based on Gemini 3 Pro.

What it actually does

The model automatically detects 70+ languages and generates smooth, natural-sounding translated speech that preserves the speakers' intonation, pacing and pitch. Unlike turn-by-turn systems that wait for the speaker to finish speaking before responding, 3.5 Live Translate generates speech continuously, balancing the trade-off between waiting for context to improve quality and translating immediately to stay in sync with the speaker. It delivers fluid audio without awkward pauses and stays just a few seconds behind the speaker throughout the session.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Google Ships Gemini 3.5 Live Translate to Kill the 3-Model Pipeline

Takeaways

The pipeline problem it kills

What it actually does

Don't miss what's next in AI