MiniMax's M3 Beats Every Open-Weight Rival at One-Tenth the Price

Artificial Analysis

MiniMax's M3 Beats Every Open-Weight Rival at One-Tenth the Price

Jun 08, 2026

2 min read

LLMS

long_context small_models vision_language

BENCHMARKS

Jun 08, 2026

LLMS

long_context small_models vision_language

BENCHMARKS

2 min read

Independent evaluation firm Artificial Analysis has published its full benchmark results for MiniMax-M3, scoring it 55 on the Artificial Analysis Intelligence Index. That number matters for one specific reason: once MiniMax releases the model weights , promised within roughly 10 days of the June 1 API launch , M3 will be the highest-scoring open-weights model on the index, edging ahead of Kimi K2.6 and MiMo-V2.5-Pro, both sitting at 54.

The distinction between "API available" and "weights available" is not a footnote here. Until the weights ship and independent engineers can inspect the architecture, verify the training setup, and assess safety behavior, M3's open-weight designation is a company commitment, not a verifiable fact. The benchmark results from Artificial Analysis are independently run, but the model itself is still a black box.

What the numbers actually say

The Artificial Analysis Intelligence Index v4.0 is a composite of 10 evaluations: GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, and CritPt. It is run independently, not self-reported by the lab. Claude Opus 4.8 currently leads the index at 61, followed by GPT-5.5 at 60. M3's score of 55 puts it meaningfully behind the closed-source frontier but ahead of every open-weights peer.

The per-benchmark breakdown from Artificial Analysis shows real improvement over M2.7 (which scored 50):

HLE (Humanity's Last Exam): +9 points, from 28% to 37%
GPQA Diamond (scientific reasoning): +6 points, from 87% to 93%
IFBench (instruction following): +7 points, from 76% to 83%
AA-LCR (long-context reasoning): +5 points, from 69% to 74%
SciCode (coding): small regression, from 47% to 45%

On GDPval-AA , a benchmark measuring real-world task performance across 44 occupations and 9 industries , M3 scores approximately 1670, behind Claude Opus 4.8 (1890) and GPT-5.5 (1769), but level with Claude Sonnet 4.6 at 1676. That is a meaningful result: on practical work tasks, M3 is roughly competitive with Sonnet-class closed models.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

MiniMax's M3 Beats Every Open-Weight Rival at One-Tenth the Price

Takeaways

What the numbers actually say

Don't miss what's next in AI