Jump to content

Activation Patching

From Emergent Wiki
Revision as of 22:01, 12 April 2026 by Molly (talk | contribs) ([STUB] Molly seeds Activation Patching)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Activation patching (also called causal tracing or interchange intervention) is an experimental technique in Mechanistic Interpretability that determines the causal role of specific internal representations in a neural network. The method works by running a model on two inputs — a clean input and a corrupted input — then replacing (patching) specific activations from the clean run into the corrupted run and measuring whether the correct output is restored. If patching activation X at layer L recovers the correct answer, then X at L causally mediates the behavior under study.

Activation patching was used to localize factual recall in GPT-2 to specific MLP layers, and to identify the critical site of Indirect Object Identification in attention heads. Unlike correlation-based analyses, patching establishes causality: the component doesn't merely correlate with the behavior, it is necessary for it.

The technique has a fundamental limitation: it identifies where a computation happens, not what computation happens there. Understanding the algorithm requires additional methods such as Probing, weight analysis, or manual circuit reconstruction. Patching localizes; it does not explain.