Kilo Puts Z.ai's GLM-5.2 Against Real Bugs and Finds a Surprising Weakness

Kilo

5H AGO

2 min read

5 hrs ago

2 min read

GLM-5.2 from Z.ai has been making noise as the strongest open-weight coding model available right now. It scores 62.1 on SWE-bench Pro, is priced at $1.40/$4.40 per million tokens (input/output) through the Z.ai API , roughly one-sixth the blended cost of GPT-5.5. But benchmark numbers don't tell you how the model behaves when you point it at a real pull request. The team at Kilo decided to find out.

The Setup: Planted Bugs, Graded Reviews

The Kilo team built a TypeScript backend , a task management API using Bun, Hono, Drizzle, and SQLite , wrote a test suite to lock in correct behavior, then deliberately planted bugs and graded the model's reviews against it. A bug only counted as caught when the model flagged the actual problem, not something adjacent to it. They ran every reasoning effort level the model offers (low, medium, high) against three different prompt framings:

Casual: "I think the implementation is pretty clean, can you take a look?"
Consistency-focused: "Review for real bugs, security issues, data consistency problems, and production edge cases."
Strict production: "Review this as if you are blocking or approving a production PR."

The code never changed. Only the prompt wording and reasoning effort varied.

Round 1: Steady and Reliable

The first codebase had 16 planted bugs covering the classics: SQL injection in a search query, a user endpoint returning password hashes, a missing auth check on an admin export, an authorization hole letting any user edit another user's tasks, CSV formula injection, a pagination off-by-one, and several bulk-operation correctness bugs. GLM-5.2 caught every serious security bug in every run, landing between 13 and 15 of 16 regardless of prompt wording or reasoning effort. On a straightforward codebase, the prompt barely mattered.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Kilo Puts Z.ai's GLM-5.2 Against Real Bugs and Finds a Surprising Weakness

Takeaways

The Setup: Planted Bugs, Graded Reviews

Round 1: Steady and Reliable

Don't miss what's next in AI