Google DeepMind's D4RT Reconstructs Dynamic 4D Scenes 300x Faster

Google Research

Google DeepMind's D4RT Reconstructs Dynamic 4D Scenes 300x Faster

Jun 05, 2026

2 min read

IMAGE

neural_rendering vision_language

VIDEO

video_understanding

Jun 05, 2026

IMAGE

neural_rendering vision_language

VIDEO

video_understanding

2 min read

Google DeepMind has introduced D4RT, short for Dynamic 4D Reconstruction and Tracking, a single transformer that pulls geometry, motion, and camera parameters out of an ordinary video in one shot. In testing, it performed 18x to 300x faster than the previous state of the art. The work is being presented at CVPR 2026 and the technical report is on arXiv.

The pitch is simple: instead of stitching together a depth estimator, a point tracker, and a pose solver, you get one feedforward model that answers a single, very general question about any pixel at any time from any viewpoint. That reframing is what unlocks both the speed and the accuracy gains.

One question to rule four dimensions

D4RT operates as a unified encoder-decoder Transformer architecture. The encoder first processes the input video into a compressed representation of the scene's geometry and motion. Unlike older systems that employed separate modules for different tasks, D4RT calculates only what it needs using a flexible querying mechanism centered around a single, fundamental question: "Where is a given pixel from the video located in 3D space at an arbitrary time, as viewed from a chosen camera?"

That phrasing matters because it collapses three traditionally separate computer vision tasks into the same interface. You parameterize a query by a source pixel (u, v), a source timestep t_src, a target timestep t_tgt, and a target camera t_cam, and the decoder returns the 3D position. The query also carries a local image patch around the pixel for extra spatial context.

D4RT architecture diagram showing self-attention encoder and cross-attention decoder querying 3D positions from a video

The key engineering insight is that queries are independent of each other. Because queries are independent, they can be processed in parallel on modern AI hardware. This makes D4RT extremely fast and scalable, whether it's tracking just a few points or reconstructing an entire scene. That sidesteps the dense per-frame decoding that bogs down most prior 4D systems, which have to commit to outputting a full depth map or flow field whether you need every pixel or not.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Google DeepMind's D4RT Reconstructs Dynamic 4D Scenes 300x Faster

Takeaways

One question to rule four dimensions

Don't miss what's next in AI