Epoch's MirrorCode Benchmark Shows Claude Opus 4.7 Rebuilding Software Autonomously in 14 Hours

Epoch AI

3H AGO

3 min read

3 hrs ago

3 min read

How long can an AI model code on its own before it needs a human to step in? That question has been frustratingly hard to answer with existing benchmarks, which mostly test bug fixes and small feature additions. MirrorCode, a new benchmark from Epoch AI co-developed with METR, takes a radically different approach: give an AI model a compiled binary it cannot read, documentation, and a test suite, then ask it to rebuild the entire program from scratch.

The results are striking. Claude Opus 4.7 achieved a headline score of 56% across 25 target programs, and on one task, reimplementing gotree (a bioinformatics toolkit with 16,000 lines of Go and 40+ commands), it passed 99.95% of tests. The team estimates that same task would take a human engineer 2 to 17 weeks. Opus 4.7 finished in 14 hours for $251.

Why existing SWE benchmarks fall short

AI models are increasingly capable at autonomous coding, and several notable software engineering benchmarks have seen rapid progress. However, these usually measure fairly short coding tasks; for example, only about 100 of the 731 SWE-bench Pro tasks involve diffs larger than 100 lines. Meanwhile, recent demos of AI coding, such as developing a new C compiler or a new browser, are impressive but hard to evaluate. The completeness of the resulting software is debatable, and the extent of human guidance is unclear, making it difficult to use these as a proxy for autonomous AI coding.

There is also a deeper structural problem: current benchmarks fall short on two dimensions: horizon and verifier strength. Dominant public benchmarks measure agent performance within minute-scale; even some of the most challenging have most tasks resolved within an hour by top agents. Spending $5 on a task that would take a human weeks is not a fair test of AI's ceiling.

The MirrorCode setup

MirrorCode addresses these problems by constructing a long-horizon coding benchmark based on existing software projects. Each task consists of a command-line program that an agent is tasked to reimplement exactly. The AI agent is given execute-only access to the original program and a set of visible test cases, but does not have access to the original source code. Think of it as: here is the black box, here is the manual, now build the same thing.

The benchmark spans a deliberately broad set of domains:

Unix utilities (e.g., cal, choose)
Bioinformatics toolkits (e.g., gotree)
Interpreters and configuration languages (e.g., Pkl)
Data serialization and query tools
Static analysis, cryptography, and compression

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Epoch's MirrorCode Benchmark Shows Claude Opus 4.7 Rebuilding Software Autonomously in 14 Hours

Takeaways

Why existing SWE benchmarks fall short

The MirrorCode setup

Don't miss what's next in AI