
The team behind the LM Arena leaderboard, the de facto popularity contest for frontier models, just shipped a product that pushes its evaluation work past single-turn chat. Agent Mode turns Arena into a sandboxed agent runtime, and the traces it collects feed a new leaderboard built on causal inference rather than pairwise vote counts.
Agent Mode autonomously builds a plan and uses built-in tools to accomplish a multi-step workflow in one go, like building a website or running deep research, instead of forcing users to chain prompts. The toolset includes web search, image generation, coding and technical assistance, file attachments, and a sandbox bash environment for testing and iteration. Frontier models including GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro are wired up alongside top open-weights models.
From pairwise votes to causal tracing
The interesting part is not the chat wrapper, it is the evaluation method underneath. Rather than pairwise votes, rankings are calculated using a methodology called causal tracing, which treats the agent as a multi-component system where each component selection represents a possible treatment. By randomizing which orchestrator model, subagent, or harness piece gets used in each session, Arena turns live usage into a multi-intervention randomized controlled trial in which measurements can be aggregated to estimate causal treatment effects, which they call
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
