Information geometry
Information geometry is the study of statistical manifolds — spaces of probability distributions equipped with a differential-geometric structure. The field was founded by C. R. Rao in 1945 and developed systematically by Shun'ichi Amann in the 1980s. Its central insight is that the space of probability distributions is not merely a set but a geometric object: a manifold with a Riemannian metric (the Fisher information metric) and a dual pair of affine connections that encode how distributions change under parameter variation.
The Fisher information metric measures the discriminability of nearby distributions: the geodesic distance between two parameter values quantifies the amount of data required to distinguish them statistically. This is not a metaphor. It is a theorem. The Cramér-Rao bound — the fundamental limit on the variance of unbiased estimators — is a statement about the curvature of statistical manifolds. Better estimators correspond to shorter paths in the geometry of distributions.
The Dual Structure of Statistical Manifolds
Information geometry reveals that statistical manifolds carry not one but two natural affine connections: the e-connection (exponential) and the m-connection (mixture). These connections are dual with respect to the Fisher metric, meaning that parallel transport along one connection preserves orthogonality in the other. This duality is not a mathematical curiosity. It encodes the fundamental duality between maximum likelihood estimation (e-flat models) and moment matching (m-flat models).
Exponential families — the distributions of maximum entropy subject to moment constraints — are e-flat: their geodesics under the e-connection are straight lines in parameter space. Mixture families are m-flat: their geodesics under the m-connection are straight lines in the space of probability densities. The dual flatness of these families means that optimization in one geometry corresponds to simple projection in the other, a fact exploited by algorithms in machine learning, statistical physics, and neural network training.
The systems-level insight: information geometry treats statistical inference not as a sequence of point estimates but as movement on a manifold. A learning algorithm that updates parameters from data is a trajectory on the statistical manifold. The convergence properties of the algorithm — whether it gets stuck in local minima, how fast it approaches the optimal distribution — are geometric properties of the manifold's curvature and the path's alignment with the natural gradient.
Applications and Extensions
Information geometry has found applications across domains that share a common structure: spaces of distributions with parameter-dependent geometry.
In statistical physics, the geometry of Gibbs ensembles encodes phase transitions. Near a critical point, the Fisher information metric diverges, reflecting the divergence of correlation length and the breakdown of mean-field approximations. The curvature of the statistical manifold signals the approach to a phase transition before any order parameter has changed — a geometric early-warning system.
In machine learning, natural gradient descent — which follows the steepest descent direction with respect to the Fisher metric rather than the Euclidean metric — accounts for the local geometry of the parameter space and often converges faster than standard gradient descent. The TRPO and PPO algorithms in reinforcement learning exploit this structure to ensure stable policy updates.
In evolutionary dynamics, information geometry provides a framework for the Price equation and the geometry of fitness landscapes. The evolution of a population's distribution over genotypes can be understood as gradient flow on a statistical manifold, with the Fisher metric encoding the population's genetic variability and the natural selection gradient pointing toward higher fitness.
The field extends beyond parametric statistics to quantum information theory, where the quantum Fisher information metric sets the ultimate precision limit for parameter estimation in quantum systems, and to computational neuroscience, where the geometry of neural population codes determines the efficiency with which stimuli are represented.
Information geometry is not merely a mathematical framework for statistics. It is a demonstration that inference — the fundamental operation of learning, science, and cognition — has a geometry, and that understanding the geometry is prerequisite to understanding the limits of what can be learned. The claim that probability distributions live on a manifold is not an abstraction. It is a fact about the structure of inductive reasoning, and any system that learns without respecting that structure — whether biological or artificial — pays a cost in efficiency, accuracy, or both.