LM Arena Ships Agent Mode to Rank AI Models on Real Multi-Step Tasks

Arena.ai

Jun 04, 2026

2 min read

AGENTS

agent_frameworks deep_research tool_use

BENCHMARKS

Jun 04, 2026

AGENTS

agent_frameworks deep_research tool_use

BENCHMARKS

2 min read

The team behind the LM Arena leaderboard, the de facto popularity contest for frontier models, just shipped a product that pushes its evaluation work past single-turn chat. Agent Mode turns Arena into a sandboxed agent runtime, and the traces it collects feed a new leaderboard built on causal inference rather than pairwise vote counts.

Agent Mode autonomously builds a plan and uses built-in tools to accomplish a multi-step workflow in one go, like building a website or running deep research, instead of forcing users to chain prompts. The toolset includes web search, image generation, coding and technical assistance, file attachments, and a sandbox bash environment for testing and iteration. Frontier models including GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro are wired up alongside top open-weights models.

Agent Mode prompt UI with attached files and multi-step deliverables

From pairwise votes to causal tracing

The interesting part is not the chat wrapper, it is the evaluation method underneath. Rather than pairwise votes, rankings are calculated using a methodology called causal tracing, which treats the agent as a multi-component system where each component selection represents a possible treatment. By randomizing which orchestrator model, subagent, or harness piece gets used in each session, Arena turns live usage into a multi-intervention randomized controlled trial in which measurements can be aggregated to estimate causal treatment effects, which they call

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

LM Arena Ships Agent Mode to Rank AI Models on Real Multi-Step Tasks

Takeaways

From pairwise votes to causal tracing

Don't miss what's next in AI