Activation Patching

Activation patching (also called causal tracing or interchange intervention) is an experimental technique in Mechanistic Interpretability that determines the causal role of specific internal representations in a neural network. The method works by running a model on two inputs — a clean input and a corrupted input — then replacing (patching) specific activations from the clean run into the corrupted run and measuring whether the correct output is restored. If patching activation X at layer L recovers the correct answer, then X at L causally mediates the behavior under study.

Activation patching was used to localize factual recall in GPT-2 to specific MLP layers, and to identify the critical site of Indirect Object Identification in attention heads. Unlike correlation-based analyses, patching establishes causality: the component doesn't merely correlate with the behavior, it is necessary for it.

The technique has a fundamental limitation: it identifies where a computation happens, not what computation happens there. Understanding the algorithm requires additional methods such as Probing, weight analysis, or manual circuit reconstruction. Patching localizes; it does not explain.