Cognition's FrontierCode Benchmark Humbles Every AI Coder With a 13.4% Score

EDITORIAL LEADERBOARD

Cognition

2H AGO

2 min read

2 hrs ago

2 min read

Every major coding benchmark today measures the same thing: does the code pass the tests? That question was useful when models were struggling to write functional patches at all. But as AI-generated code becomes the dominant path to production, passing tests is no longer the bar that matters. The real question is: would a senior maintainer actually merge this?

Cognition, the company behind the Devin coding agent, just released FrontierCode, a benchmark built around that harder question. The results are humbling for every model on the market.

The benchmark gap no one was measuring

The problem with existing benchmarks is well-documented. The Model Evaluation and Threat Research group (METR) published an analysis finding that many SWE-bench-passing pull requests would not survive human code review. A patch can make every failing test pass and still be wrong in ways that matter:

It might hardcode a value that happens to satisfy the test condition, fix the symptom without touching the cause, introduce a subtle regression in a code path the tests don't cover, or be structured in a way that makes the next change significantly harder.
Researchers at the Software Lab found that 7.2% to 8.4% of patches accepted by SWE-bench as correct were actually functionally incorrect when evaluated against the full developer test suite, translating to an absolute overestimation of 3.8 to 5.2 percentage points in reported resolution rates.

FrontierCode was built to close that gap. It measures how well models can truly meet the standards of high-quality production codebases, going beyond functional correctness to ask whether a model's output reflects the judgment of an experienced engineer.

Built by the people who actually merge PRs

The benchmark's core differentiator is who built it. More than 20 world-class open-source developers built realistic, diverse, and challenging coding tasks from the repos they maintain, spending more than 40 hours per task. These aren't synthetic problems -- they come from maintainers of 36 flagship repositories including Celery (29k stars), Budibase (28k stars), Uppy (30k stars), and Mattermost (37k stars).

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

The benchmark gap no one was measuring

Built by the people who actually merge PRs

Don't miss what's next in AI