Artificial Analysis' AA-Briefcase Reveals Top AI Agents Fail 97% of Real Work Tasks

Artificial Analysis

19H AGO

2 min read

19 hrs ago

2 min read

Most AI benchmarks measure what a model knows. AA-Briefcase measures what a model can actually do , and the results are a reality check for anyone deploying agents on real knowledge work. Artificial Analysis just released this proprietary benchmark, and it is unlike anything in the current evaluation landscape.

Not your typical benchmark

The core idea is simple but demanding: put an AI agent inside a realistic multi-week professional project and see if it can produce actual deliverables. AA-Briefcase is a private evaluation for frontier agentic capability in long-horizon knowledge work, testing agents on realistic business workflows that require deliverables such as spreadsheets, presentations, and memos, across four multi-week knowledge work projects comprising thousands of input files and 91 tasks in total.

The four project scenarios span:

Data Science , transaction data cleaning, forecast modeling, schema design
Product Management , competitive teardowns, PRD writing, go-to-market planning
Banking Operations , branch network transformation, financial modeling, mortgage journey mapping
Heavy Industry Strategy , commodity supply-demand models, M&A valuation benchmarking

Tasks build week by week, draw on shared institutional context, and require deliverables such as financial models, board presentations, and design mock-ups. The source material is fragmented, messy, and often contains realistic contradictions, testing whether models can navigate the ambiguity of real-world knowledge work. In total, the benchmark contains nearly 2,000 source files, including 25,000+ Slack messages and 3,500+ emails.

How it scores

Scoring is a three-part composite. Each task is graded on:

Rubric pass rate , binary pass/fail checks for ground-truth correctness (did the model find the right files, reach the right conclusions?)
Analytical quality Elo , pairwise comparison: which model's output is more rigorous and well-supported?
Presentation Elo , pairwise comparison: which output is more professionally presented?

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Artificial Analysis' AA-Briefcase Reveals Top AI Agents Fail 97% of Real Work Tasks

Takeaways

Not your typical benchmark

How it scores

Don't miss what's next in AI