
Most AI benchmarks measure what a model knows. AA-Briefcase measures what a model can actually do , and the results are a reality check for anyone deploying agents on real knowledge work. Artificial Analysis just released this proprietary benchmark, and it is unlike anything in the current evaluation landscape.
Not your typical benchmark
The core idea is simple but demanding: put an AI agent inside a realistic multi-week professional project and see if it can produce actual deliverables. AA-Briefcase is a private evaluation for frontier agentic capability in long-horizon knowledge work, testing agents on realistic business workflows that require deliverables such as spreadsheets, presentations, and memos, across four multi-week knowledge work projects comprising thousands of input files and 91 tasks in total.
The four project scenarios span:
- Data Science , transaction data cleaning, forecast modeling, schema design
- Product Management , competitive teardowns, PRD writing, go-to-market planning
- Banking Operations , branch network transformation, financial modeling, mortgage journey mapping
- Heavy Industry Strategy , commodity supply-demand models, M&A valuation benchmarking
Tasks build week by week, draw on shared institutional context, and require deliverables such as financial models, board presentations, and design mock-ups. The source material is fragmented, messy, and often contains realistic contradictions, testing whether models can navigate the ambiguity of real-world knowledge work. In total, the benchmark contains nearly 2,000 source files, including 25,000+ Slack messages and 3,500+ emails.
How it scores
Scoring is a three-part composite. Each task is graded on:
- Rubric pass rate , binary pass/fail checks for ground-truth correctness (did the model find the right files, reach the right conclusions?)
- Analytical quality Elo , pairwise comparison: which model's output is more rigorous and well-supported?
- Presentation Elo , pairwise comparison: which output is more professionally presented?
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves

