Sakana AI's CoffeeBench Catches Claude Haiku 4.5 Going Bankrupt Over 90 Days

Sakana AI

4H AGO

2 min read

BENCHMARKS

AGENTS

deep_research multi_agent tool_use

4 hrs ago

BENCHMARKS

AGENTS

deep_research multi_agent tool_use

2 min read

Most AI benchmarks are sprints. A model reads a prompt, generates an answer, and gets scored. CoffeeBench, a new benchmark from Sakana AI and KPMG Japan's audit firm AZSA, is something closer to a marathon , and a competitive one. Six LLM agents are dropped into a simulated coffee industry supply chain and told to run their businesses for 90 simulated days. The goal: maximize net profit. The result: some models thrive, one quietly goes bankrupt, and a genuinely new class of failure mode gets a name.

Why existing benchmarks fall short

The problem with most agentic benchmarks is that they test a single agent against a passive environment. Real economic activity doesn't work that way. Economic systems are inherently multi-agent, requiring autonomous agents to communicate, negotiate, and transact while pursuing their own objectives over extended periods. Benchmarks like MMLU or HumanEval measure a single response in isolation. Even newer agentic evals rarely go beyond a few dozen steps.

The closest predecessor to CoffeeBench is Vending-Bench, which tested whether LLMs could run a vending machine business solo. While LLMs can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons , Vending-Bench tested an agent's ability to manage a long-running business scenario, with tasks that are each simple but collectively stress an LLM's capacity for sustained, coherent decision-making. CoffeeBench takes that premise and multiplies it: six companies, six agents, all interacting with each other.

A supply chain in a sandbox

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Sakana AI's CoffeeBench Catches Claude Haiku 4.5 Going Bankrupt Over 90 Days

Takeaways

Why existing benchmarks fall short

A supply chain in a sandbox

Don't miss what's next in AI