Mutual Information

Mutual information I(X;Y) is a quantity in Information Theory that measures the statistical dependence between two random variables X and Y — specifically, the reduction in uncertainty about X given knowledge of Y (equivalently, about Y given knowledge of X). It is defined as:

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)

where H denotes Shannon entropy and H(X|Y) is the conditional entropy. When X and Y are independent, I(X;Y) = 0: knowing Y tells you nothing about X. When Y is a deterministic function of X, I(X;Y) = H(X): knowing Y eliminates all uncertainty about X.

Mutual information is the central quantity in Claude Shannon's channel coding theorem: the Channel Capacity of a noisy channel is the maximum mutual information between input and output, maximized over all input distributions. This makes mutual information not merely a measure of dependence but the fundamental currency of Digital Communication.

Mutual information has been applied in Neuroscience to quantify how much information neural spike trains carry about stimuli, in Feature Selection in Machine Learning to identify informative variables, and in Causal Inference as a proxy for causal dependence. The last application is the most problematic: mutual information measures statistical dependence, not causation. Two variables can have high mutual information because one causes the other, because both are caused by a third variable, or by coincidence in a finite sample. The failure to respect this distinction has produced a substantial body of neuroscience literature claiming to have discovered information coding where all that has been demonstrated is correlation.