DeepSWE, Datacurve's contamination-free coding benchmark, just updated its leaderboard and the top spot has changed hands. Claude Fable 5 now sits at #1 with a 70% pass@1 score, edging out GPT-5.5 by 3 percentage points and setting a new state-of-the-art on what is arguably the most rigorous public coding agent evaluation available today. Kimi K2.7 also makes its debut on the board at 31%.

Why this benchmark is different

To understand why this result matters, you need to understand what DeepSWE is actually testing. Today's leading public coding benchmarks are starting to saturate at the frontier, with top models clustering within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them.

The benchmark is built on four core design principles that distinguish it from SWE-Bench Pro, the previous gold standard:

  • Contamination-free tasks: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.
  • High diversity: Tasks span a broad pool of 91 repositories across 5 languages.
  • Real-world complexity: Prompts are roughly half the length of SWE-bench Pro's, yet solutions require 5.5x more code and approximately 2x more output tokens.
  • Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details, accepting any solution with correct observable behavior regardless of internal symbol names or structure.

The complexity gap is significant. Where SWE-Bench Verified tasks average 10 lines of code in the reference solution and SWE-Bench Pro averages 120, DeepSWE's average reference solution is 668 lines across 7 files. These are not one-liner patches. They are multi-file feature implementations.

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves