

When teams benchmark coding agents, they almost always report one number: pass rate. But a new paper accepted at ICML 2026 from Sentient AI argues that this single metric is dangerously incomplete. The scaffold , the software harness that wraps the model and orchestrates its tool use, context, and control flow , can swing token costs by up to 40x across agents with nearly identical pass rates. In other words, you might be paying 40 times more for the same result, and your benchmark would never tell you.
The Problem With Leaderboards
Today's coding agent leaderboards rank models by pass rate on benchmarks like SWE-bench or Terminal-Bench. But a model never runs alone. It runs inside a scaffold (also called a harness) , the layer that decides how many turns the agent gets, what tools it can call, how context is managed, and how errors are handled. Agent scaffolds comprise the prompts, tools, and control logic that allow language models to solve agentic tasks. The scaffold is invisible on most leaderboards, yet it shapes everything.
The Sentient paper, titled "The Scaffold Effect in Coding Agents", makes this invisible variable visible. Researchers Naman Vats and Oleg Golev tested two frontier models , Qwen 3.6 Plus and MiniMax M2.5 , across three open-source harnesses: Goose, OpenCode, and the OpenHands SDK, on 50 Terminal Bench Pro tasks.
What They Found: Failure Has a Fingerprint
The most striking finding is that harnesses have unique failure fingerprints, regardless of which model runs inside them. The same scaffold produces the same kinds of errors whether you plug in Qwen or MiniMax. This means failure modes are a property of scaffold design, not model capability , a fundamental reframing of how we should think about agent debugging.
The paper identified three key dimensions where harnesses diverge:
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
