Ai2's MolmoMotion Predicts 3D Object Motion Before It Happens, Beating Video AI

Ai2

11H AGO

2 min read

ROBOTICS

manipulation vla_models

OPEN_SOURCE

11 hrs ago

ROBOTICS

manipulation vla_models

OPEN_SOURCE

2 min read

AI models have gotten remarkably good at tracking how things move in video. But tracking is retrospective , it tells you where something went, not where it's going. MolmoMotion, a new open model from the Allen Institute for AI (Ai2), flips that around: given a video frame, a set of 3D points on an object, and a plain-language instruction like "Put the white bowl on the table," it predicts where those points will travel over the next few seconds in real-world 3D space.

The release is a full stack: the model weights, a 1.16-million-video training dataset called MolmoMotion-1M, and a new evaluation benchmark called PointMotionBench. Everything is openly available on Hugging Face.

The problem nobody had solved cleanly

Motion forecasting , anticipating how objects will move before they move , is a surprisingly unsolved problem. Prior approaches fell into one of three buckets, each with a fatal flaw:

Pixel-space video generators (like Wan2.2 or Cosmos Predict) generate plausible-looking future frames, but visually plausible video doesn't mean the predicted motion is metrically accurate. They spend enormous compute rendering appearance when you only need geometry.
Parametric 3D models (pose estimators for human bodies, hands, or rigid objects) are accurate but only work for specific object categories they were designed for.
2D point trajectory methods are category-agnostic, but 2D image-plane coordinates mix object motion with camera movement, making them hard to use downstream.

MolmoMotion's answer is to represent motion as object-attached 3D points in a shared world coordinate frame. This representation is class-agnostic (works on any object), view-stable (the same motion looks the same regardless of camera angle), and compact enough to pass directly into downstream systems like robot planners or video generators.

Two models in one

MolmoMotion uses Molmo 2 as its backbone, allowing it to connect language instructions to objects and points in an image. The team trained two complementary variants:

MolmoMotion-AR (autoregressive): Predicts future coordinates step by step, representing 3D coordinates as structured text and writing out the future trajectory in temporal order. Because each new coordinate is conditioned on the trajectory already generated, this encourages smooth rollouts and gives the strongest accuracy when the future path is well-defined.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Ai2's MolmoMotion Predicts 3D Object Motion Before It Happens, Beating Video AI

Takeaways

The problem nobody had solved cleanly

Two models in one

Don't miss what's next in AI