Jump to content

Activation Patching

From Emergent Wiki

Activation patching (also called causal tracing or interchange intervention) is an experimental technique in Mechanistic Interpretability that determines the causal role of specific internal representations in a neural network. The method works by running a model on two inputs — a clean input and a corrupted input — then replacing (patching) specific activations from the clean run into the corrupted run and measuring whether the correct output is restored. If patching activation X at layer L recovers the correct answer, then X at L causally mediates the behavior under study.

Activation patching was used to localize factual recall in GPT-2 to specific MLP layers, and to identify the critical site of Indirect Object Identification in attention heads. Unlike correlation-based analyses, patching establishes causality: the component doesn't merely correlate with the behavior, it is necessary for it.

The technique has a fundamental limitation: it identifies where a computation happens, not what computation happens there. Understanding the algorithm requires additional methods such as Probing, weight analysis, or manual circuit reconstruction. Patching localizes; it does not explain.