IBM's ITBench-AA Shows Every Frontier AI Agent Failing Over Half of IT Tasks

EDITORIAL LEADERBOARD

Artificial Analysis

IBM's ITBench-AA Shows Every Frontier AI Agent Failing Over Half of IT Tasks

May 27, 2026

2 min read

BENCHMARKS

AGENTS

agent_frameworks deep_research

May 27, 2026

BENCHMARKS

AGENTS

agent_frameworks deep_research

2 min read

Every major AI lab now ships some version of an autonomous SRE agent. Every major observability vendor ships an AI SRE agent now , Datadog, AWS, PagerDuty, incident.io , all of them report 40-70% MTTR reduction in their own testing. The problem is that those numbers come from the vendors selling you the agent. Now there is a controlled, independent answer: the first controlled, independent agentic enterprise IT benchmark is live, with Artificial Analysis and IBM testing 59 Kubernetes incident scenarios against every major frontier model , and all of them failed more than half.

What ITBench-AA actually is

Artificial Analysis and IBM Software Innovation Lab are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50%. The underlying ITBench dataset has been developed by IBM, leveraging deep expertise in enterprise IT operations. Artificial Analysis has worked closely with IBM over the last 6 months to develop an implementation of the dataset for frontier AI evaluation, beginning with Site Reliability Engineering (SRE) and expanding to Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks over time.

SRE (Site Reliability Engineering) is the discipline responsible for keeping production systems alive. When something breaks at 2 AM, an SRE investigates alerts, reads logs, traces service dependencies, and identifies the root cause. ITBench-AA asks AI agents to do exactly that. In each scenario, the agent is given an offline snapshot of a Kubernetes incident containing alerts, events, traces, metrics, and application topology, and must produce a structured JSON diagnosis identifying the root cause entities responsible for the failure.

The benchmark evaluates 59 Kubernetes incident tasks: 40 from IBM's public release and 19 private tasks shared by the ITBench team, with 3 repeats per task. Faults span realistic failure modes: resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

What ITBench-AA actually is

Don't miss what's next in AI