Epoch Expands its ECI to 65 Benchmarks as Classic AI Evals Die

Epoch AI

Epoch Expands its ECI to 65 Benchmarks as Classic AI Evals Die

18H AGO

2 min read

BENCHMARKS

LLMS

long_context mixture_of_experts vision_language

18 hrs ago

BENCHMARKS

LLMS

long_context mixture_of_experts vision_language

2 min read

Epoch AI just expanded its AI Capabilities Benchmarking Hub with 13 new evaluations, bringing the total to 65 distinct benchmarks tracked across mathematics, coding, agentic tasks, science, games, and more. Seven of those new evals have been incorporated directly into the Epoch Capabilities Index (ECI), the composite score Epoch uses to track frontier AI progress over time.

The benchmark graveyard problem

The expansion isn't just housekeeping. It's a direct response to a growing crisis in AI evaluation: classic benchmarks are dying. Top frontier models now cluster above 89% on MMLU-Pro, a concentration that signals the benchmark's diminishing utility as a differentiator. When every top model scores within a point of each other, the eval tells you nothing useful about which model to pick for a real task.

This is the core problem the ECI was designed to solve. ECI "stitches" benchmarks together to enable comparisons even as individual benchmarks become saturated, and allows models to be compared even if they were never evaluated on the same benchmarks. Adding harder, more diverse evals keeps the composite score meaningful as the frontier pushes forward.

What's new

The nine external benchmarks added in the latest wave span agentic work, cybersecurity, algorithm engineering, forecasting, and research-level physics. Looking at the full hub, the newly added evals include some genuinely novel test designs:

ALE-Bench -- evaluates AI on long-horizon, objective-driven algorithm engineering using hard combinatorial optimization problems from competitive programming contests
GBAEval -- a long-horizon software engineering benchmark that tasks coding agents with implementing a Game Boy Advance emulator from scratch

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Epoch Expands its ECI to 65 Benchmarks as Classic AI Evals Die

Takeaways

The benchmark graveyard problem

What's new

Don't miss what's next in AI