Differential entropy
Differential entropy is the extension of Shannon entropy from discrete probability distributions to continuous ones. For a continuous random variable with probability density function f(x), the differential entropy is defined as h(X) = −∫ f(x) log f(x) dx, where the integral ranges over the support of the distribution. Unlike its discrete counterpart, differential entropy is not an absolute measure of uncertainty. It can be negative, it is not invariant under coordinate transformations, and it does not bound the expected code length in the way that Shannon entropy does. These properties make differential entropy a stranger and more subtle quantity than its discrete ancestor — and a more powerful one, once its limitations are understood.
From Discrete to Continuous
The discrete entropy H(X) = −Σ p(x) log p(x) measures the expected information content of a random variable in bits (or nats). When one passes to the continuous limit by refining a discrete histogram into an increasingly fine partition, the discrete entropy diverges — it grows without bound as the bin width shrinks. The differential entropy emerges as the finite remnant of this divergence: the part of the entropy that depends on the shape of the density, rather than the resolution of the observation.
This means that differential entropy is not a limit of discrete entropy. It is a renormalized quantity, defined relative to a uniform reference measure. The differential entropy of a Gaussian distribution, for instance, is ½ log(2πeσ²), which is negative for σ² < 1/(2πe). The negativity is not a paradox; it merely reflects that the distribution is more concentrated than the uniform reference. Differential entropy measures relative concentration, not absolute information.
Estimation and the Curse of Dimensionality
Estimating differential entropy from finite samples is one of the foundational problems in non-parametric statistics. The Kozachenko-Leonenko estimator addresses it by exploiting local structure through nearest-neighbor distances, avoiding the need for binning or kernel density estimation that fails in high dimensions. The KSG estimator extends this approach to mutual information, creating a family of methods that treat entropy as a local geometric property rather than a global statistical one.
The difficulty of estimation reveals a deeper truth: differential entropy is not merely a function of the density but a function of the geometry of the probability distribution. In high-dimensional spaces, the concept of a 'well-defined' density becomes problematic. The curse of dimensionality means that the volume of space grows exponentially while the data remains sparse, and the nearest-neighbor framework that works well in low dimensions becomes unstable. This is not a failure of estimation; it is a signal that the assumption of a smooth underlying density may be wrong.
Connections to Statistical Mechanics and Inference
Differential entropy plays a central role in the maximum entropy principle, where one seeks the probability distribution that maximizes entropy subject to known constraints. For continuous variables, the maximum entropy distribution under a variance constraint is the Gaussian — a result that connects differential entropy to the central limit theorem and to the physics of thermal equilibrium. The Jaynesian framework treats differential entropy as a measure of the 'spread' of a distribution, and the maximum entropy solution as the least biased estimate compatible with the data.
In statistical mechanics, differential entropy appears in the Gibbs entropy formula, where the continuous phase space of classical mechanics demands a continuous entropy measure. The connection between the Gibbs entropy and the Liouville theorem — which states that phase space volume is conserved under Hamiltonian dynamics — shows that differential entropy is not merely a statistical curiosity. It is the bridge between probability theory and the physics of many-body systems.
Differential entropy is often taught as a footnote to Shannon entropy: 'the continuous version, but be careful, it can be negative.' This is a disservice. The negativity is not a bug but a feature. It reveals that entropy is not a measure of information but a measure of relative concentration — and that the most interesting distributions in nature are precisely those that are more concentrated than the uniform background. Differential entropy is not a failed discrete entropy. It is a successful geometric entropy, one that tells us not how much we know, but how much more concentrated the world is than we expected.