Datacurve has added Claude Opus 4.8 to its DeepSWE leaderboard, and the numbers tell a nuanced story. On the default high thinking effort setting, Opus 4.8 scores 6% higher than Opus 4.7 at its xhigh setting, while also lowering the average cost per task. That is a genuine efficiency win. But it also puts the new model in sharper relief against GPT-5.5, which still leads the leaderboard by a significant margin.

What DeepSWE actually measures

Before diving into the numbers, it helps to understand why DeepSWE results are worth paying attention to. Anthropic released Claude Opus 4.8 alongside a wave of benchmark claims, but most public coding benchmarks are starting to saturate at the frontier. DeepSWE was built specifically to separate models that cluster together on those older tests.

The benchmark has four properties that distinguish it from alternatives like SWE-Bench Pro:

  • Contamination-free tasks: Every task is written from scratch, not adapted from existing commits or pull requests. The solutions have never appeared in any model's training data.
  • Real-world complexity: Prompts are roughly half the length of SWE-Bench Pro's, yet reference solutions require 5.5x more code and around 2x more output tokens to produce.
  • Broad coverage: 113 tasks across 91 repositories in TypeScript, Go, Python, JavaScript, and Rust.
  • Reliable verification: Verifiers are hand-written to test observable software behavior, not implementation details. DeepSWE's false positive rate is 0.3%, compared to 8.5% on SWE-Bench Pro.

That last point matters more than it sounds. Anthropic's evaluations showed Opus 4.8 was around four times less likely than its predecessor to allow flaws in code it generated to go unremarked. A benchmark that grades incorrectly a third of the time would obscure exactly that kind of improvement.

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves