Kullback-Leibler Divergence

The Kullback-Leibler divergence (KL divergence), also called relative entropy, is a non-symmetric measure of the difference between two probability distributions P and Q. It quantifies the information lost when Q is used to approximate P, or equivalently, the expected extra number of bits required to encode samples from P using a code optimized for Q. The KL divergence from P to Q is defined as:

D_KL(P || Q) = Σ P(x) log(P(x)/Q(x))

The KL divergence is always non-negative, and it equals zero if and only if P and Q are identical almost everywhere. It is not a true metric because it is not symmetric — D_KL(P || Q) ≠ D_KL(Q || P) in general — and it does not satisfy the triangle inequality. Despite this, it is the canonical measure of distributional divergence in information theory, statistics, and machine learning.

In machine learning, the KL divergence appears in variational inference, where it is minimized to fit an approximate posterior distribution to a true posterior. It also appears in the PAC-Bayes framework as a regularization term that penalizes posterior distributions far from the prior. In information geometry, the KL divergence induces a Riemannian metric on the manifold of probability distributions, connecting statistical inference to differential geometry.

The KL divergence is the natural objective in many optimization problems because it corresponds to the expected log-likelihood ratio, making it the statistic of choice for hypothesis testing and model comparison.