Ai2's OLMo Hybrid Beats Transformers on Meaning While Using 49% Fewer Tokens

Ai2

Ai2's OLMo Hybrid Beats Transformers on Meaning While Using 49% Fewer Tokens

13H AGO

2 min read

LLMS

long_context mixture_of_experts

BENCHMARKS

13 hrs ago

LLMS

long_context mixture_of_experts

BENCHMARKS

2 min read

The transformer has ruled language modeling for years, but a new class of architectures is mounting a serious challenge: hybrid models that mix traditional attention layers with linear recurrent layers (modern RNNs). The question is no longer whether hybrids can match transformers on leaderboards , they can , but why, and on which specific tasks. Ai2 just published a technical report that answers this at the finest possible granularity: the individual token level.

The field is already moving

Hybrid language models , architectures that mix transformer attention with linear recurrent layers , have been gaining momentum across the field, with recent efforts from projects like Samba, Nemotron-H, Qwen3-Next, Kimi Linear, and Qwen 3.5. These models have been trained at scales up to 9B active parameters and 36T tokens with encouraging results. Yet despite the momentum, a fundamental question has gone unanswered: what, exactly, does each architectural component contribute to the model's predictions?

Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks and hybrid models that mix recurrence and attention , yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. Ai2's new study is a direct attempt to resolve that.

A controlled experiment, token by token

The key to this study's credibility is the experimental setup. Ai2 compared OLMo 3 (a pure transformer) and OLMo Hybrid (which swaps 75% of attention layers for Gated DeltaNet recurrent layers) in a head-to-head evaluation. The hybrid uses a 3:1 hybridization ratio, replacing the sliding-window attention layers from OLMo 3 with Gated DeltaNet layers. Because both models were built to be as alike as possible outside their architectures , matched on data, tokenizer, and training recipe , any difference in their predictions mostly reflects the architecture itself.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Ai2's OLMo Hybrid Beats Transformers on Meaning While Using 49% Fewer Tokens

Takeaways

The field is already moving

A controlled experiment, token by token

Don't miss what's next in AI