
Claude Opus 4.8 is now generally available inside GitHub Copilot. Anthropic's latest flagship model brings meaningful gains in agentic coding, large-codebase navigation, and self-reported code reliability, and it lands at the same API price as Opus 4.7. The catch: it carries a 15x premium request multiplier in Copilot, making it the most expensive model in the picker by a wide margin.
What actually changed under the hood
Opus 4.8 builds on Opus 4.7 with improvements across benchmarks and is described by Anthropic as a more effective collaborator. The headline numbers back that up. On SWE-bench Verified, the model scores 88.6%, up from 87.6% on Opus 4.7 and 80.8% on Opus 4.6. On the harder SWE-bench Pro benchmark, it hits 69.2%, up from 64.3%. SWE-bench Pro is a benchmark that tests whether a model can autonomously resolve real GitHub issues, making it a strong proxy for practical coding ability.
Opus 4.8 is also the strongest computer-use and browser-agent model Anthropic has tested, scoring 84% on Online-Mind2Web, a meaningful jump over both Opus 4.7 and GPT-5.5. That said, Terminal-Bench 2.1 for agentic terminal coding still belongs to GPT-5.5 at 78.2%, with Opus 4.8 coming in at 74.6%.
The reliability story is the real headline
Raw benchmark scores are only part of the picture. The more interesting shift is in what Anthropic calls "honesty" -- the model's tendency to flag its own mistakes rather than silently ship broken code into your pipeline.
- Anthropic's evaluations found the model to be around four times less likely than its predecessor to leave flaws in its own code unremarked.
- Early testers found Opus 4.8 sharper in judgment when performing agentic tasks, more likely to flag uncertainty, and less likely to make unsupported claims.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
