Zhipu AI's GLM 5.2 Tops Open-Source Coding at 44%, Beating Rivals for Less

Datacurve

2H AGO

2 min read

2 hrs ago

2 min read

Datacurve just updated the DeepSWE leaderboard, and the new top open-source entry is GLM 5.2 from Zhipu AI. Running at max effort, it scores 44% pass@1 on the benchmark, putting it 17 percentage points ahead of the previous open-source leader, Kimi K2.7 Code, which sits at 31%. That gap is not a rounding artifact: it is wider than the entire spread between most frontier models on older benchmarks like SWE-Bench Verified.

What makes DeepSWE different

To understand why this result matters, you need to know what DeepSWE is actually measuring. Most public coding benchmarks are starting to saturate: top models cluster in a narrow score band where differences fall inside confidence intervals. DeepSWE was built specifically to break that logjam.

The benchmark covers 113 tasks across 91 open-source repositories in TypeScript, Go, Python, JavaScript, and Rust. Four properties set it apart from SWE-Bench and its variants:

Contamination-free tasks: Every task is written from scratch, not adapted from existing commits or pull requests. The task container ships only a shallow clone with no gold commit in the workspace, so there is nothing for an agent to look up.
Genuinely hard: Prompts average 2,158 characters, roughly half the length of SWE-Bench Pro's, yet reference solutions average 668 lines of code added across 7 files. That is 5.5x more code than SWE-Bench Pro requires per task.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Zhipu AI's GLM 5.2 Tops Open-Source Coding at 44%, Beating Rivals for Less

Takeaways

What makes DeepSWE different

Don't miss what's next in AI