Liquid AI's LFM2.5 Beats a 7.7B Voice Model at Just 1.5B Parameters

Liquid AI

Jun 06, 2026

2 min read

AUDIO

speech_to_text text_to_speech

LLMS

small_models

Jun 06, 2026

AUDIO

speech_to_text text_to_speech

LLMS

small_models

2 min read

Liquid AI just dropped two new open-weight models targeting the Japanese language market, and the headline number is hard to ignore: a 1.5B-parameter audio model that beats a 7.7B competitor in conversational benchmarks. The two releases are LFM2.5-Audio-1.5B-JP, the company's first Japanese speech-to-speech model, and LFM2.5-1.2B-JP-202606, an updated Japanese text model. Both are available now on Hugging Face.

One model, no pipeline glue

The audio model is the more technically interesting of the two. Most production voice systems are stitched together from three separate components: a speech recognizer (ASR) to transcribe the user, a language model to generate a response, and a text-to-speech engine (TTS) to speak it back. That pipeline adds latency at every seam and creates failure modes at each handoff.

LFM2.5-Audio-1.5B-JP is an end-to-end multimodal speech and text language model that does not require separate ASR and TTS components. Designed with low latency and real-time conversation in mind, it enables seamless Japanese conversational interaction at only 1.5 billion parameters.

The model consists of a pretrained LFM2.5 backbone, a FastConformer-based audio encoder to handle continuous audio inputs, and an RQ-transformer generating discrete tokens coupled with a lightweight audio detokenizer for audio output. The FastConformer encoder (115M parameters) is based on NVIDIA's Canary checkpoint, and audio output uses Kyutai's Mimi codec with 8 codebooks at 24kHz.

Benchmark comparison of LFM2.5-1.2B-JP-202606 against sub-2B Japanese models across knowledge, instruction following, math, code, and tool use tasks

Two generation modes for different tasks

The audio model supports two distinct generation routines. Interleaved generation enables real-time speech-to-speech conversational chatbot capabilities where audio generation latency is key. Sequential generation is suited for non-conversational tasks such as ASR or TTS, and allows the model to switch generated modality on the fly.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Liquid AI's LFM2.5 Beats a 7.7B Voice Model at Just 1.5B Parameters

Takeaways

One model, no pipeline glue

Two generation modes for different tasks

Don't miss what's next in AI