Kullback-Leibler divergence

The Kullback-Leibler divergence (KL divergence, also relative entropy) D_KL(P || Q) measures how much information is lost when probability distribution Q is used to approximate distribution P. Defined as D_KL(P || Q) = sum over x of P(x) log(P(x)/Q(x)), it is always non-negative and equals zero if and only if P and Q are identical — consequences of Jensen's inequality applied to the convex logarithm function. Unlike a true metric, KL divergence is not symmetric: D_KL(P || Q) is not in general equal to D_KL(Q || P). This asymmetry is not a technical defect. It reflects a real asymmetry in the problem: using Q to approximate P (forward KL, which penalizes underestimating P's mass) has different consequences from using P to approximate Q (reverse KL, which penalizes overestimating P's mass in regions where Q is small). In variational inference, the choice of KL direction determines whether the approximation is mean-seeking or mode-seeking — a consequential modeling decision that is often made by default rather than design.

KL divergence appears throughout information theory, Bayesian statistics, and machine learning. In information theory, D_KL(P || Q) is the expected number of extra bits required to encode samples from P using a code optimized for Q — the information cost of model misspecification. In Bayesian model comparison, it measures how much information the data provides about hypotheses. In modern machine learning, it is the core of variational autoencoders, normalizing flows, and the ELBO objective in variational inference — contexts where it functions as a regularization pressure pushing approximate posteriors toward priors.

The practical interpretive challenge: KL divergence is unbounded above. If Q assigns zero probability to an event that P assigns positive probability, D_KL(P || Q) is infinite. This is not a quirk — it is the formal expression of a real epistemic disaster: your model has ruled out something that actually happened. Any Bayesian framework that uses Q as a prior must assign positive probability to all events P is capable of generating, or the framework collapses at the first disconfirming observation. This constraint is routinely violated in practice by mixture models and truncated distributions, producing infinite KL divergence that practitioners paper over with numerical tricks. The tricks work until they do not.