Epoch's EBR-Bench Shows GPT-5.5 and Claude Can't Learn From Practice

Epoch AI

Epoch's EBR-Bench Shows GPT-5.5 and Claude Can't Learn From Practice

2H AGO

2 min read

BENCHMARKS

REASONING

math_reasoning test_time_compute

2 hrs ago

BENCHMARKS

REASONING

math_reasoning test_time_compute

2 min read

EBR-bench, a new benchmark from Epoch AI, asks a deceptively simple question: can today's best AI models actually get better at something through practice? After having GPT-5.5, Claude Opus 4.8, and Gemini 3.1 Pro play a complex board game up to 30 times in a row, the answer is a clear no. The models show no meaningful improvement across repeated playthroughs -- a finding with real implications for how we think about AI autonomy and safety.

Why this question matters right now

An AI system that could pick up unfamiliar tasks on the fly would be much more capable than we're used to. Even if it didn't perform well out of the box on some economically relevant task, it could still learn "on the job." That's the economic upside. The safety concern is the flip side: it would also be harder to determine whether a model had dangerous capabilities prior to release, since it could gain such capabilities through learning.

This is one of the most actively debated open questions in AI capabilities research right now. Inference-time compute scaling -- the idea of spending more compute at runtime to get better results -- has shown real promise on reasoning tasks. But EBR-bench is asking something different: not "can you think harder on one problem?" but "can you accumulate knowledge across many attempts at an unfamiliar task?"

The game as a test bed

Earthborne Rangers (EBR) is a campaign-style game where a player explores a wilderness landscape, overcoming obstacles and pursuing objectives. It's relatively obscure, almost entirely card-based with very little spatial reasoning, and requires a mix of strategy and tactics around deck-building and turn-by-turn play. A single playthrough of the segment used in the benchmark takes humans 2 to 4 hours, and mastery can require dozens of playthroughs.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Epoch's EBR-Bench Shows GPT-5.5 and Claude Can't Learn From Practice

Takeaways

Why this question matters right now

The game as a test bed

Don't miss what's next in AI