
Kilo Code published a head-to-head benchmark pitting two newly released open-weight coding models against each other: Z.ai's GLM-5.2 and Moonshot AI's Kimi K2.7 Code. The two are often compared as similarly priced rivals in the open-weight space, and with the latest version of each out in the same week, Kilo wanted to see how they stack up head to head. The result is a structured two-phase test that separates planning from building, and the gap between the two models showed up almost entirely in the first phase.
One task, two phases, one clear winner
Kilo ran both models through the same two-phase test. First, each model planned a backend service. They scored the plans, picked the stronger one, and then had both models build that exact plan from scratch in Kilo Code CLI. The task was a feature flag service: a backend that decides whether a feature is on for a given user and supports gradual percentage rollouts.
The task is deceptively hard. The rollout has to be deterministic: if a user is included in the first 20%, they should still be included when the rollout grows to 40%. The service also cannot solve that by storing every user assignment in a database. A weak plan waves this away. A strong plan nails the exact math.
Both nailed the hard part. Each landed on the same kind of rollout math, the kind that grows a rollout without dropping anyone already in it. But on the judgment calls the prompt left open, they diverged.
Where GLM-5.2 pulled ahead
GLM-5.2 scored 9.0 against Kimi K2.7 Code's 8.1 on the planning rubric. The difference wasn't volume of output. Kimi's plan was actually longer and included more ready-to-paste code. The gap was in which model made the hard calls and showed its reasoning.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves

