Google's Gemini 3.5 Live Translate Kills the 3-Step Pipeline for $0.023 a Minute

Google AI Developers

7D AGO

2 min read

AUDIO

realtime_voice speech_to_text text_to_speech

API

7 days ago

AUDIO

realtime_voice speech_to_text text_to_speech

API

2 min read

Real-time speech translation has been a solved problem on paper for years. In practice, it has always felt like a game of telephone: your voice gets transcribed to text, that text gets machine-translated, and the result gets synthesized back into speech. Every handoff adds latency and a new surface for errors to compound. Gemini 3.5 Live Translate collapses that three-step cascade into a single end-to-end audio model, and the difference is audible.

The pipeline problem it actually solves

Older real-time translation systems, including Google Meet's previous implementation, ran a cascaded three-step pipeline: transcribe speech to text (STT), translate that text, then synthesize the translated text back to speech (TTS). Each hop adds latency and a place for errors to compound -- a mistranscription becomes a mistranslation becomes a confidently wrong spoken sentence.

Three architectural choices distinguish Gemini 3.5 Live Translate from earlier streaming-translation systems. Traditional pipelines run audio through a streaming speech-to-text model, feed the transcript to a machine-translation model, then synthesize the translation through a separate text-to-speech model. Each stage adds latency and accumulates errors. The new model fuses all three stages into one, processing the raw audio stream directly.

Unlike turn-by-turn systems that wait for the speaker to finish speaking before responding, 3.5 Live Translate generates speech continuously, balancing the trade-off between waiting for context to improve quality and translating immediately to stay in sync with the speaker. The result is a translation that stays just a few seconds behind the speaker throughout the session, with no awkward pauses between utterances.

What it's built on

Developers can access the model via the Gemini Live API and Google AI Studio, currently in public preview. According to the official model card, Gemini 3.5 Live Translate is part of the Gemini 3 family of models, and is specifically based on Gemini 3 Pro. The model takes raw audio in and outputs raw audio -- there is no text intermediary in the core translation path, though you can optionally request text transcripts of both input and output for debugging.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Google's Gemini 3.5 Live Translate Kills the 3-Step Pipeline for $0.023 a Minute

Takeaways

The pipeline problem it actually solves

What it's built on

Don't miss what's next in AI