Reka's WorldModelGym Tests If AI World Models Actually Help Agents Win

Reka

8H AGO

2 min read

8 hrs ago

2 min read

WorldModelGym is a new benchmark from Reka that asks a deceptively simple question about world models: if an agent uses one to choose between possible actions, does it pick the right one? Not "does it generate realistic-looking video?" or "does it reconstruct the input accurately?" -- just: does it actually help the agent win?

The distinction matters more than it might seem. Most existing world model benchmarks measure visual quality, physical plausibility, or reconstruction accuracy. These benchmarks largely treat world models as video generators and do not assess their functional roles in agent decision-making. WorldModelGym fills that gap with a concept Reka calls decision-based fidelity.

The Gap Nobody Was Measuring

A world model, in the RL sense, is a learned simulator: given a state and an action, it predicts the next state and reward. By unrolling these predictions, the model simulates how the world evolves, effectively allowing the agent to "see" the future. The problem is that a model can look great on perceptual benchmarks while still steering an agent toward terrible decisions -- its errors might cancel out visually but compound catastrophically when used for planning.

WorldModelGym is designed to catch exactly that failure mode. It indirectly evaluates world models through the lens of decision-based fidelity: how useful is the world model in relation to the real world, by measuring the consequence, in the real world, of acting based on the model's predictions.

WorldModelGym environments across Atari, Meta-World, DeepMind Control, and classic control tasks

How the Evaluation Works

The protocol is elegant. Rather than running a full agent rollout (expensive and slow), the benchmark presents a world model with a multiple-choice test at a critical decision point. For each question, it provides five different choices (action sequences) for the world model, including one random choice. Since all five have already been run in the real environment, the benchmark knows each one's true outcome.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Reka's WorldModelGym Tests If AI World Models Actually Help Agents Win

Takeaways

The Gap Nobody Was Measuring

How the Evaluation Works

Don't miss what's next in AI