Artificial Analysis Rebuilds Intelligence Index v4.1 Around Agentic AI Tasks

Artificial Analysis

1D AGO

2 min read

1 day ago

2 min read

Benchmarks have a shelf life. Once frontier models start clustering at the top, the scores stop telling you anything useful. Artificial Analysis just released Intelligence Index v4.1, a significant overhaul of its composite AI scoring system, and the headline change is a deliberate pivot: the index now weights agentic tasks more heavily than any other category, reflecting where the real capability race is happening in 2026.

What actually changed

Three core updates define v4.1. First, the benchmark composition itself was restructured. Terminal-Bench Hard was upgraded to Terminal-Bench 2.1 and τ²-Bench Telecom was replaced by τ³-Bench Banking, both moving to newer, more robust task sets with harder, more realistic agentic scenarios that better separate frontier models. The Intelligence Index is now calculated as a weighted average across four categories: Agents (34%), Coding (24%), Scientific Reasoning (24%), and General (18%), with the weighting explicitly emphasizing agentic tasks.

Second, GDPval-AA, the highest-weighted single evaluation in the index at 20%, got a meaningful upgrade. GDPval-AA v2 re-baselines Elo scores to human expert performance at 1000, introduces a panel of three frontier LLM judges from leading labs replacing a single judge, and expands turn limits to 250 to allow for even longer-horizon agent trajectories. This benchmark tests models on real knowledge-work deliverables across 44 occupations, graded by blind pairwise comparison.

Third, IFBench was dropped from the index entirely. AI benchmarks usually have a short shelf life: once models start scoring near the top, the evals stop being useful for telling systems apart, and most evaluations added to the Intelligence Index saturate within about six months. IFBench, which tested precise instruction-following on novel constraints, finally hit that wall with frontier models. Artificial Analysis says it will continue running and publishing IFBench results, just not counting them in the composite score.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Artificial Analysis Rebuilds Intelligence Index v4.1 Around Agentic AI Tasks

Takeaways

What actually changed

Don't miss what's next in AI