
World Labs, the spatial intelligence company co-founded by Fei-Fei Li, has published three research papers in a single drop, each tackling a different angle of the same hard problem: how do you get rich, complete 3D geometry out of 2D inputs? The trio covers depth prediction from a single image, dynamic 4D reconstruction from monocular video, and a unified model for joint text-image-depth reasoning. Together, they sketch out a research agenda for building world models that actually understand space.
Why this is hard
Standard depth estimation gives you one depth value per pixel -- a flat sheet draped over the visible surface of a scene. That works fine for what you can see, but the moment something is occluded (hidden behind another object), you get nothing. Reconstructing dynamic scenes from a single moving camera is even harder: you have no second viewpoint to triangulate from, no known camera parameters, and the scene itself is changing. The field has made steady progress, but most methods still require multi-view setups, calibrated cameras, or static scenes to work reliably.
World Labs is betting that large-scale generative models -- the same diffusion and video models powering image and video synthesis -- carry enough implicit 3D knowledge to close that gap. All three papers are built on that thesis.
Paper 1: World Tracing -- depth as a stack, not a sheet
The first paper introduces World Tracing, a method that predicts full 3D geometry from a single image, including surfaces that are completely hidden from view. The key idea is to stop thinking about depth as a single value per pixel and instead predict a
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
