Speculative decoding is one of the most impactful inference tricks in production LLM serving today. The idea is simple: a small, cheap draft model proposes several tokens at once, and the large target model verifies them all in a single forward pass. If the guesses are good, you get multiple tokens for the price of one. EAGLE achieves 2-3x speedups over standard autoregressive decoding and is widely deployed in production inference frameworks including vLLM, SGLang, and TensorRT-LLM. But there has been a quiet, persistent problem lurking in deeper speculation chains -- and the EAGLE team just found it, named it, and fixed it.

The EAGLE series -- including EAGLE 1, EAGLE 2, and EAGLE 3 -- has become one of the most widely adopted families of speculative decoding algorithms across both research and production systems. Now, the EAGLE team, vLLM team, and TorchSpec team have jointly introduced EAGLE 3.1 -- a major step forward in speculative decoding robustness, efficiency, and deployability.

The Bug That Nobody Had a Name For

While speculative decoding performs well in controlled settings, performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts. The EAGLE team traced this fragility to a phenomenon they call attention drift -- as speculation depth increases, the drafter gradually shifts attention away from sink tokens and toward its own generated tokens.

Sink tokens are special tokens (like the beginning-of-sequence token) that transformers rely on as stable "anchors" for attention. When the drafter stops attending to them and starts over-attending to its own recent outputs, it loses its grounding in the original prompt -- and its token proposals become increasingly unreliable the further out it speculates.

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves