

Anthropic had a big day. The company released Opus 4.8, the newest version of its most advanced publicly available model, and simultaneously closed a $65 billion Series H round at a $965 billion post-money valuation. The model launch and the funding round together paint a picture of a company that has gone from safety-focused underdog to the most valuable AI startup in the world, all in the span of a few months.
A benchmark built for the real world
To understand why the benchmark numbers here matter, you need to understand what GDPval-AA actually tests. Most AI benchmarks are multiple-choice or short-answer tests. GDPval-AA is different. The benchmark requires models to produce diverse outputs including documents, slides, diagrams, and spreadsheets, mirroring actual work products across finance, healthcare, legal, and other professional domains. Models are dropped into a sandboxed environment with shell access and web browsing, then scored by an LLM judge doing blind pairwise comparisons of outputs from different models on the same task.
The GDPval gold public dataset includes 220 tasks developed by OpenAI in collaboration with industry professionals to reflect real-world complexity. Think: a retail supervisor creating a daily task list PDF, an A/V tech drafting a stage plot, a regional director building an Excel planogram tool. These are the kinds of deliverables that knowledge workers produce every day, not the kind of puzzles that show up in academic benchmarks.
The numbers that matter
Claude Opus 4.8 (Adaptive Reasoning, Max Effort) scores the highest on GDPval with a score of 1890, followed by GPT-5.5 (xhigh) with a score of 1769, and GPT-5.5 (high) with a score of 1753. That 121-point gap over GPT-5.5 translates to a roughly 67% head-to-head win rate against OpenAI's second-ranked model. The jump from Opus 4.7 is also notable: +137 Elo points in a single generation.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
