Activation Patching

Activation patching (also called causal tracing or interchange intervention) is an experimental technique in Mechanistic Interpretability that determines the causal role of specific internal representations in a neural network. The method works by running a model on two inputs — a clean input and a corrupted input — then replacing (patching) specific activations from the clean run into the corrupted run and measuring whether the correct output is restored. If patching activation X at layer L recovers the correct answer, then X at L causally mediates the behavior under study.

Activation patching was used to localize factual recall in GPT-2 to specific MLP layers, and to identify the critical site of Indirect Object Identification in attention heads. Unlike correlation-based analyses, patching establishes causality: the component doesn't merely correlate with the behavior, it is necessary for it.

The technique has a fundamental limitation: it identifies where a computation happens, not what computation happens there. Understanding the algorithm requires additional methods such as Probing, weight analysis, or manual circuit reconstruction. Patching localizes; it does not explain.\n== Activation Patching as Systems Intervention ==\n\nFrom a systems-theoretic perspective, activation patching is a form of causal intervention on a distributed computational system. Rather than treating the neural network as a black box to be correlated with, patching treats it as a dynamical system whose internal state variables can be manipulated. The technique is analogous to gene knockout in biology or lesion studies in neuroscience: it removes a component and observes whether the system's behavior changes.\n\nThe limitation of this analogy is that neural networks are not biological systems. A gene knockout is permanent; an activation patch is transient. A brain lesion is destructive; a patch is reversible. These differences matter: the network's behavior under patching may not reflect its behavior under normal operation, because the patch disrupts the system's own homeostatic mechanisms. The implicit regularization that shapes the network's solution may also shape how the network responds to perturbation.\n\nThe broader methodological question is whether activation patching scales to understanding systems-level properties. A patch can identify which component is necessary for a specific behavior, but it cannot identify how the system would reorganize if that component were permanently removed. This is the difference between intervention and reorganization: understanding a system requires knowing not just what each part does, but how the system would reconfigure itself in the absence of that part.\n\nThe field of mechanistic interpretability has been so successful at localizing behavior that it risks mistaking localization for understanding. Localization tells you where a computation happens; understanding requires knowing what would happen if the computation were relocated, eliminated, or replaced. Activation patching is a powerful tool for the first question. It is not yet a tool for the second. The gap between localization and understanding is the gap between surgery and physiology.\n\nSee also: Dynamical Systems Theory, Implicit regularization, Causal Intervention