Datacurve just updated the DeepSWE leaderboard, and the new top open-source entry is GLM 5.2 from Zhipu AI. Running at max effort, it scores 44% pass@1 on the benchmark, putting it 17 percentage points ahead of the previous open-source leader, Kimi K2.7 Code, which sits at 31%. That gap is not a rounding artifact: it is wider than the entire spread between most frontier models on older benchmarks like SWE-Bench Verified.

What makes DeepSWE different

To understand why this result matters, you need to know what DeepSWE is actually measuring. Most public coding benchmarks are starting to saturate: top models cluster in a narrow score band where differences fall inside confidence intervals. DeepSWE was built specifically to break that logjam.

The benchmark covers 113 tasks across 91 open-source repositories in TypeScript, Go, Python, JavaScript, and Rust. Four properties set it apart from SWE-Bench and its variants:

  • Contamination-free tasks: Every task is written from scratch, not adapted from existing commits or pull requests. The task container ships only a shallow clone with no gold commit in the workspace, so there is nothing for an agent to look up.
  • Genuinely hard: Prompts average 2,158 characters, roughly half the length of SWE-Bench Pro's, yet reference solutions average 668 lines of code added across 7 files. That is 5.5x more code than SWE-Bench Pro requires per task.
Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves