When you download an open-source LLM, you're not just getting a model trained on text. You're getting the downstream product of dozens of other models, hundreds of datasets, and a web of upstream decisions that nobody has ever mapped end-to-end. ModSleuth, a new agentic system from the Allen Institute for AI (Ai2), is the first tool designed to automatically reconstruct that entire family tree from public artifacts alone.

The dependency problem nobody was counting

Modern LLMs don't just learn from human-written text anymore. They rely on other models to generate synthetic training data, filter and clean corpora, judge output quality, run OCR on documents, and guide reinforcement learning. Each of those upstream models has its own dependencies. The result is a recursive, multi-layered supply chain that no human team can trace manually.

The numbers make this concrete. Applying ModSleuth to four public-artifact-rich LLM releases, the researchers recovered 1,060 source-verified dependencies and constructed large-scale dependency graphs of modern LLM development. Breaking that down by model: OLMo 3 alone has 89 model dependencies and 183 dataset dependencies. Nemotron 3 Super has 273 model dependencies and 560 dataset dependencies. Some of those chains go 8 hops deep.

How ModSleuth actually works

ModSleuth is an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. That means it doesn't just read a model card and call it done. It reads papers, model cards, dataset cards, code configs, and upstream release artifacts, then follows every reference it finds, recursively, until the graph is complete.

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves