
Datacurve just updated the DeepSWE leaderboard, and the new top open-source entry is GLM 5.2 from Zhipu AI. Running at max effort, it scores 44% pass@1 on the benchmark, putting it 17 percentage points ahead of the previous open-source leader, Kimi K2.7 Code, which sits at 31%. That gap is not a rounding artifact: it is wider than the entire spread between most frontier models on older benchmarks like SWE-Bench Verified.
What makes DeepSWE different
To understand why this result matters, you need to know what DeepSWE is actually measuring. Most public coding benchmarks are starting to saturate: top models cluster in a narrow score band where differences fall inside confidence intervals. DeepSWE was built specifically to break that logjam.
The benchmark covers 113 tasks across 91 open-source repositories in TypeScript, Go, Python, JavaScript, and Rust. Four properties set it apart from SWE-Bench and its variants:
- Contamination-free tasks: Every task is written from scratch, not adapted from existing commits or pull requests. The task container ships only a shallow clone with no gold commit in the workspace, so there is nothing for an agent to look up.
- Genuinely hard: Prompts average 2,158 characters, roughly half the length of SWE-Bench Pro's, yet reference solutions average 668 lines of code added across 7 files. That is 5.5x more code than SWE-Bench Pro requires per task.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
