Goodfire's Silico Catches Hidden AI Training Bugs Before a Single Step Runs

EDITORIAL LEADERBOARD

Goodfire

3H AGO

2 min read

3 hrs ago

2 min read

Before you run a single training step, your preference dataset has already decided what your model will learn. Sycophancy, broken guardrails, hallucinated links , it's all in there, quietly waiting. Goodfire just built a way to read it first.

The San Francisco-based interpretability lab has released predictive data debugging, a technique that tells you exactly which behaviors a DPO (Direct Preference Optimization) fine-tuning run will amplify or suppress , before you ever launch training. The method is backed by a 73-page paper and is being built into Silico, Goodfire's platform for model design.

The problem with preference data

Post-training is where most of a model's behavior gets shaped. But the process compresses a rich, messy set of goals into a single scalar reward signal , a thumbs up or thumbs down on each response. That abstraction gives practitioners little visibility into what their data actually teaches models. The standard workflow is: collect preference data, run DPO, eval the result, and then try to reverse-engineer what went wrong from a handful of aggregate scores.

When an eval regresses, you're left guessing which of your 260,000 preference pairs did it. And the worst part? Some behaviors are so specific and unexpected that you'd never think to write an eval for them in the first place.

Interpreting the model to interpret the data

The core insight is elegant: if you can interpret a model's internal representations, you can use that model as a lens to interpret your dataset. Goodfire's approach uses Sparse Autoencoders (SAEs) , a technique that decomposes a model's internal activations into a large dictionary of human-readable "features" or concepts, each corresponding to something the model has learned to recognize. SAEs express model activations as a sparse linear combination of interpretable feature vectors, making it possible to see what concepts a model is computing for any given input.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

The problem with preference data

Interpreting the model to interpret the data

Don't miss what's next in AI