
Most AI benchmarks ask a model to answer a question, solve a math problem, or write a function. AA-Briefcase asks it to act as a consultant for six weeks, synthesize 25,000 Slack messages and 3,500 emails, build a financial model, and present findings to a board. The gap between those two things is exactly the gap the benchmark is designed to expose.
What the benchmark actually tests
Artificial Analysis built AA-Briefcase around four private, multi-week knowledge work scenarios covering data science, product management, banking operations, and heavy industry strategy. Across those four scenarios there are 91 tasks in total, each requiring a realistic professional deliverable: spreadsheets, PowerPoint decks, PDFs, memos. The tasks build week by week within a scenario, sharing institutional context, so a model working on week 4 needs to understand what happened in weeks 1 through 3.
The input data is deliberately messy. Each task draws on hundreds of files, and the full benchmark contains nearly 2,000 source files, including:
- 25,000+ Slack messages
- 3,500+ emails
- Company documents, meeting transcripts, and large-scale data exports
- Files that contradict each other, simulating real organizational noise
Grading is three-dimensional. Each submission is scored on a binary rubric (did the model find the right numbers, cite the right sources, resolve planted contradictions?), then evaluated pairwise against other models for analytical quality and presentation quality. The combined score is expressed as an Elo rating, the same system used in chess rankings, where a higher number means the model consistently beats lower-ranked models head-to-head.
The leaderboard, and what it costs
Claude Fable 5 achieves the highest AA-Briefcase Elo, which combines rubric pass rate with pairwise analytical quality Elo and presentation quality Elo. It scores 1587 Elo, well ahead of the field. This is followed by Claude Opus 4.8 (max) and GLM-5.2 (max), with GPT-5.5 (xhigh) in fourth.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
