Anthropic's Claude Fable 5 Beats the Hardest Coding Benchmark in Under 2 Hours

EDITORIAL LEADERBOARD

Mechanize

7H AGO

2 min read

7 hrs ago

2 min read

Anthropic's Claude Fable 5 just set a new record on one of the most demanding coding benchmarks in existence: writing a Game Boy Advance emulator from scratch. Mechanize, the AI evaluation company, ran Fable 5 through its GBA Eval and the results are striking , both for what the model got right, and for one surprising regression.

The hardest coding test you've never heard of

GBA Eval is not a typical benchmark. Models are tasked with writing, from scratch, a Game Boy Advance emulator in Rust that compiles to WebAssembly. That alone is a multi-week project for a skilled human engineer. The grading is equally rigorous: accuracy is measured using a combination of existing open-source test suites and test cases of gameplay patterns in real ROMs, evaluated using a custom harness.

The key insight that makes this benchmark fair is the GBA's hardware determinism. This kind of grading is tractable because the GBA console itself has no entropy source , there is no RTC, wall clock, or analog input on the console. That means the emulator's output on any given input sequence is perfectly reproducible, so Mechanize can compare the model's emulator frame-by-frame against Mesen2, a reference-quality open-source GBA emulator.

The setup each model gets is specific: each model gets a Docker container with the Rust and wasm32 toolchain, the ABI specification, a BIOS stub, dev ROMs, and an oracle CLI , a black-box wrapper around Mesen2 that the model can run on any ROM with any input sequence to see Mesen2's behavior. The model cannot read Mesen2's source and does not have access to the internet. Each model runs for 24 hours, with checkpoints taken every 15 minutes.

74.5% and a two-hour knockout

Claude Fable 5 scored 74.5% on GBA Eval, the highest score any model has achieved. But the more telling number is the speed: it beat Opus 4.8's 24-hour score in under 2 hours. For context, Claude Opus 4.8 scored 70.9% on GBA Eval , given 24 hours, it writes an emulator that plays most games with working audio on all of them, and it beat the previous best (GPT-5.5 at 53.2%) in under an hour. Fable 5 has now leapfrogged that bar entirely.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

The hardest coding test you've never heard of

74.5% and a two-hour knockout

Don't miss what's next in AI