Testing a robot policy in the real world is slow, expensive, and brutally hard to scale. A human has to set up the scene, watch the robot attempt the task, reset everything, and repeat -- dozens of times per policy checkpoint. While recent years have seen substantial progress in developing more capable and general robot policies, their evaluation remains a persistent challenge, and the problem becomes especially acute as policies grow more generalist, requiring broader and more diverse evaluation scenarios. Now, SC3-Eval, a joint project from Physical Intelligence and NVIDIA, proposes a different answer: skip the real world entirely and let a video world model do the judging.

The Evaluation Bottleneck Nobody Talks About

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. As robot foundation models -- the VLAs (vision-language-action models) that take camera feeds and language instructions and output motor commands -- become more capable, the number of tasks they need to be tested on grows with them. Physical Intelligence noted that evaluating all the different things their models can do takes longer and longer, and a world model eval that needs only a couple hours of GPU time is a compelling alternative.

Real-world evaluation is inherently unscalable: it is limited by logistics, safety concerns, and reproducibility issues, and requires significant human involvement for setup, execution, and scoring. Human operators must supervise trials and manually reset scenes, which restricts the scale and frequency of evaluations. The idea of using a learned world model as a proxy evaluator has been explored before, but prior work struggled with a core technical problem: errors compound.

When a video world model generates the next frame from the current frame and the robot's action, small mistakes accumulate. By step 50 of a manipulation task, the imagined scene can look nothing like reality. This is the

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves