Anthropic's Claude Opus 4.8 Hits 69.2% on SWE-Bench and Lands in Devin

Windsurf is now Devin Desktop

Anthropic's Claude Opus 4.8 Hits 69.2% on SWE-Bench and Lands in Devin

May 28, 2026

2 min read

May 28, 2026

2 min read

Anthropic's Claude Opus 4.8 is now available inside Windsurf (now rebranding to Devin Desktop) and the Devin CLI. It's the most capable model yet to land in the IDE, and it arrives with a meaningful set of improvements that matter specifically for the kind of long-running, agentic coding sessions these tools are built around.

What changed under the hood

Opus 4.8 is an upgrade to Anthropic's Opus class of models, with stronger performance across coding, agentic tasks, and professional work, and the consistency to handle long-running work. The headline number: Opus 4.8 leads in 6 out of 7 benchmarks, achieving the highest score on SWE-Bench Pro at 69.2%.

But the more interesting improvement is behavioral. Early testers found Opus 4.8 to be more reliable and sharper in its judgment when performing agentic tasks. Specifically, the model is significantly less likely to silently paper over its own mistakes. According to Anthropic's evaluations, Opus 4.8 is around four times less likely than its predecessor to let flaws in code it has written pass unremarked , a critical property when you're trusting an agent to run unattended.

Opus 4.8 uses tools cleanly and follows instructions with the consistency autonomous engineering workloads need to keep running unattended. It improves on Opus 4.6 and fixes the comment-verbosity and tool-calling issues seen with Opus 4.7. Cognition's CEO Scott Wu noted directly that this release from Anthropic translates directly into faster capability gains for engineers building on Devin.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Anthropic's Claude Opus 4.8 Hits 69.2% on SWE-Bench and Lands in Devin

Takeaways

What changed under the hood

Don't miss what's next in AI