Artificial Analysis' AA-Briefcase Benchmark Exposes What AI Models Can't Actually Do

Artificial Analysis

4H AGO

2 min read

BENCHMARKS

AGENTS

deep_research multi_agent tool_use

4 hrs ago

BENCHMARKS

AGENTS

deep_research multi_agent tool_use

2 min read

Most AI benchmarks ask a model to answer a question, solve a math problem, or write a function. AA-Briefcase asks it to act as a consultant for six weeks, synthesize 25,000 Slack messages and 3,500 emails, build a financial model, and present findings to a board. The gap between those two things is exactly the gap the benchmark is designed to expose.

What the benchmark actually tests

Artificial Analysis built AA-Briefcase around four private, multi-week knowledge work scenarios covering data science, product management, banking operations, and heavy industry strategy. Across those four scenarios there are 91 tasks in total, each requiring a realistic professional deliverable: spreadsheets, PowerPoint decks, PDFs, memos. The tasks build week by week within a scenario, sharing institutional context, so a model working on week 4 needs to understand what happened in weeks 1 through 3.

The input data is deliberately messy. Each task draws on hundreds of files, and the full benchmark contains nearly 2,000 source files, including:

25,000+ Slack messages
3,500+ emails
Company documents, meeting transcripts, and large-scale data exports
Files that contradict each other, simulating real organizational noise

Grading is three-dimensional. Each submission is scored on a binary rubric (did the model find the right numbers, cite the right sources, resolve planted contradictions?), then evaluated pairwise against other models for analytical quality and presentation quality. The combined score is expressed as an Elo rating, the same system used in chess rankings, where a higher number means the model consistently beats lower-ranked models head-to-head.

The leaderboard, and what it costs

Claude Fable 5 achieves the highest AA-Briefcase Elo, which combines rubric pass rate with pairwise analytical quality Elo and presentation quality Elo. It scores 1587 Elo, well ahead of the field. This is followed by Claude Opus 4.8 (max) and GLM-5.2 (max), with GPT-5.5 (xhigh) in fourth.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Artificial Analysis' AA-Briefcase Benchmark Exposes What AI Models Can't Actually Do

Takeaways

What the benchmark actually tests

The leaderboard, and what it costs

Don't miss what's next in AI