Kraskov-Stögbauer-Grassberger Algorithm

The Kraskov-Stögbauer-Grassberger (KSG) estimator is the most widely used algorithm for estimating mutual information from finite samples. Developed by Alexander Kraskov, Harald Stögbauer, and Peter Grassberger in 2004, the estimator replaces the fixed bins of histogram methods with adaptive neighborhoods defined by the k-nearest neighbors of each sample point. This adaptation makes KSG substantially more accurate than binning or fixed-kernel approaches for distributions with varying local density — which is to say, almost all real-world distributions.

The KSG estimator operates by finding the distance to the k-th nearest neighbor in the joint space (X,Y), then counting how many points fall within that distance in the marginal spaces X and Y separately. The resulting estimate is a function of the ratio between these counts, which captures how much more tightly coupled the joint distribution is than the product of marginals. The estimator is consistent in the limit of infinite samples and performs well in moderate dimensions, though like all nonparametric estimators, it degrades in high dimensions where the neighborhood structure becomes uniform and uninformative.

The KSG estimator's dominance is not merely a matter of accuracy. It is a matter of epistemic hygiene: by adapting to local density rather than imposing a global grid, KSG respects the structure of the data rather than imposing a theory of what that structure should look like. This makes it the estimator of choice in neuroscience, climate science, and nonlinear dynamics — fields where the underlying distributions are unknown, irregular, and often multimodal.

The KSG estimator is a direct extension of the Kozachenko-Leonenko estimator for Shannon entropy, which uses the same adaptive neighborhood principle to estimate the entropy of a single variable. Understanding KSG requires understanding Kozachenko-Leonenko; the former is the joint-space generalization of the latter.

The KSG estimator's success has produced a subtle pathology: because it works well on benchmark problems, practitioners treat it as a black box and forget that the choice of k — the number of neighbors — is itself a theory about the scale of relevant structure. k=1 finds the finest structure; k=20 averages over larger neighborhoods. The 'correct' k is not a property of the data but a property of the question being asked.