Anthropic's Claude Opus 4.8 just became the new state-of-the-art on ARC-AGI-3, the hardest active benchmark in AI research, scoring 1.5% on the semi-private evaluation set. That number sounds tiny, but it's the highest any frontier language model has ever achieved on a test where humans score 100% and the previous best LLM sat at 0.43%.

What ARC-AGI-3 actually is

ARC-AGI-3 is an interactive reasoning benchmark that challenges AI agents to explore novel environments, acquire goals on the fly, build adaptable world models, and learn continuously. That's a meaningful departure from every benchmark that came before it.

Instead of presenting static puzzles with clear input-output pairs, it drops AI agents into interactive environments with no instructions, no stated goals, and no explicit rules. The agent has to figure out everything on its own through trial and observation, the same way a person would when handed a game they have never seen before.

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves