
Spatial reasoning -- figuring out where things are, how they relate in 3D space, and how they move -- is one of the hardest unsolved problems for vision-language models (VLMs). Most agents that try to tackle it hit the same wall: they call a fixed menu of tools in a rigid order, and the moment their initial plan is wrong, there's no way to recover. SpatialClaw, a new training-free framework from NVIDIA Research, proposes a different answer: give the agent a live Python kernel and let it write code.
The bottleneck was never the tools
Spatial reasoning remains a fundamental challenge for VLMs, and tool-augmented agents attempt to address this by augmenting them with specialist perception modules -- yet their effectiveness is bounded by the action interface through which those tools are invoked. SpatialClaw's core argument is that the interface itself, not the tools, is what limits performance.
To understand why, consider the two dominant approaches before SpatialClaw:
- Single-pass code: The agent writes a complete Python program in one shot, before seeing any intermediate results. If the initial assumption is wrong, the error propagates all the way to the final answer.
- Structured tool-call: The agent dispatches tools through a fixed JSON API. It can call tools, but it can't freely chain their outputs, apply NumPy operations on the results, or branch based on what it sees.
SpatialClaw treats code as the action interface: a VLM-backed agent writes one Python cell per step into a persistent Jupyter kernel pre-loaded with perception primitives (SAM3 segmentation, Depth-Anything-3 reconstruction, geometry utilities) and scientific libraries (NumPy, SciPy, Matplotlib). Each cell can compose tool outputs, inspect intermediate evidence, and revise the analysis before committing an answer with
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves

