Goodfire's Silico Erases a Language From AI Using Just one Parameter

Goodfire

7H AGO

2 min read

POST_TRAINING

alignment fine_tuning lora

LLMS

7 hrs ago

POST_TRAINING

alignment fine_tuning lora

LLMS

2 min read

Fine-tuning is usually a blunt instrument. You adjust thousands or millions of parameters, cross your fingers that the model improves on your target task, and hope you haven't quietly broken something else. Goodfire just demonstrated a sharper approach: they removed a 67M-parameter language model's ability to speak German by tuning exactly one scalar value , and the rest of the model barely noticed.

The idea behind parameter decomposition

To understand what's happening here, you need to know about parameter decomposition , Goodfire's method for breaking a model's weight matrices into interpretable, rank-1 subcomponents (think of them as atomic building blocks of the model's learned behavior). The method, adVersarial Parameter Decomposition (VPD), optimizes for decompositions of neural network parameters into simple subcomponents that preserve the network's input-output behavior even when many subcomponents are ablated, including under ablations that are adversarially selected to destroy behavior.

Each subcomponent is a rank-1 matrix , the simplest possible matrix , with one "read" direction (what it looks for in the input) and one "write" direction (what it adds to the output). This encourages learning subcomponents that provide short, mechanistically faithful descriptions of the network's behavior that should aggregate appropriately into more global descriptions of the network's learned algorithm. Crucially, each subcomponent gets an auto-generated label describing what it activates on , e.g., "fires on German text and names."

Tuning one number to erase a language

The experiment was done as a one-day hackathon using Goodfire's Silico platform. The goal: destroy the model's ability to predict German text while keeping English intact. German was chosen because it was the model's strongest non-English language.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Goodfire's Silico Erases a Language From AI Using Just one Parameter

Takeaways

The idea behind parameter decomposition

Tuning one number to erase a language

Don't miss what's next in AI