Intelligent Internet's Zenith Jumps GPT-5.5 From 5th to 1st on Hardest Coding Benchmark

Intelligent Internet

8H AGO

2 min read

OPEN_SOURCE

POST_TRAINING

distillation dpo fine_tuning

8 hrs ago

OPEN_SOURCE

POST_TRAINING

distillation dpo fine_tuning

2 min read

The conventional wisdom in AI agent work is simple: when your agent gets stuck, upgrade the model. Intelligent Internet's Zenith just ran a controlled experiment that challenges that assumption directly. By wrapping GPT-5.5 in a purpose-built agent harness, they moved the same model from 5th place to 1st on FrontierSWE , the hardest public long-horizon software engineering benchmark , beating Claude Fable in the process. The harness is now open source.

The benchmark that breaks every agent

FrontierSWE, built by Proximal AI, is not a typical coding benchmark. It contains ultra-long-horizon, open-ended technical challenges such as optimizing compilers or training state-of-the-art models for protein prediction. Agents are given 20 hours per task; despite this, most models barely make progress on any task, making FrontierSWE one of the few unsaturated public benchmarks. On average, agents run for 11 hours per task and fail to solve almost all of them.

The benchmark exposes a specific failure mode that Intelligent Internet calls premature completion: agents don't give up, they declare victory too early. The tests they write for themselves are superficial enough to make a wrong solution look right. The agent submits, the independent test suite fails, and nobody catches it.

Fifth to first, same model

Intelligent Internet ran the full 17-task FrontierSWE suite with Zenith wrapping GPT-5.5. The results are stark: by mean@5 (average rank across five runs), Zenith scored 2.06 average rank with 92% dominance, landing first. The same GPT-5.5 model on its default Codex harness scored 5.53, sitting fifth. The benchmark, the trial budget, and the base model were identical. Only the control loop changed.

The gap is widest on implementation tasks , the longest-horizon work on the benchmark. GPT-5.5 under Codex ranks 7.40 there. With Zenith, it ranks 1.60, ahead of every other entry including Fable. That's the result Intelligent Internet cares most about, because Anthropic's own launch notes for Fable said the longer and more complex the task, the larger Fable's lead grows.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Intelligent Internet's Zenith Jumps GPT-5.5 From 5th to 1st on Hardest Coding Benchmark

Takeaways

The benchmark that breaks every agent

Fifth to first, same model

Don't miss what's next in AI