Causal Intervention

Causal intervention is the deliberate manipulation of a system to test whether a specific component or variable is causally necessary for an observed behavior, as opposed to merely correlated with it. In the context of machine learning and mechanistic interpretability, causal interventions include techniques such as activation patching, ablation studies, and interchange interventions that modify internal representations to trace their causal contribution to model outputs.

The philosophical framework for causal intervention derives from Judea Pearl's do-calculus, which formalizes the difference between observing a variable and intervening upon it. The do-operator, do(X=x), represents the act of setting X to a specific value rather than passively observing that X takes that value. This distinction is critical because observational data cannot distinguish causal pathways from confounding associations.

The rise of causal intervention methods in AI interpretability represents a shift from the correlational paradigm that dominated machine learning for decades. But there is a deeper shift: the recognition that understanding a system requires not just predicting its behavior but knowing how it would behave under perturbation. A system that is only observed, never intervened upon, is a system whose causal structure remains hypothetical. The test of understanding is always the test of intervention.