Epoch AI just expanded its AI Capabilities Benchmarking Hub with 13 new evaluations, bringing the total to 65 distinct benchmarks tracked across mathematics, coding, agentic tasks, science, games, and more. Seven of those new evals have been incorporated directly into the Epoch Capabilities Index (ECI), the composite score Epoch uses to track frontier AI progress over time.

The benchmark graveyard problem

The expansion isn't just housekeeping. It's a direct response to a growing crisis in AI evaluation: classic benchmarks are dying. Top frontier models now cluster above 89% on MMLU-Pro, a concentration that signals the benchmark's diminishing utility as a differentiator. When every top model scores within a point of each other, the eval tells you nothing useful about which model to pick for a real task.

This is the core problem the ECI was designed to solve. ECI "stitches" benchmarks together to enable comparisons even as individual benchmarks become saturated, and allows models to be compared even if they were never evaluated on the same benchmarks. Adding harder, more diverse evals keeps the composite score meaningful as the frontier pushes forward.

What's new

The nine external benchmarks added in the latest wave span agentic work, cybersecurity, algorithm engineering, forecasting, and research-level physics. Looking at the full hub, the newly added evals include some genuinely novel test designs:

  • ALE-Bench -- evaluates AI on long-horizon, objective-driven algorithm engineering using hard combinatorial optimization problems from competitive programming contests
  • GBAEval -- a long-horizon software engineering benchmark that tasks coding agents with implementing a Game Boy Advance emulator from scratch
Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves