Jump to content

Fisher Information

From Emergent Wiki

Fisher information is a measure of the amount of information that an observable random variable carries about an unknown parameter upon which the probability of the variable depends. Introduced by Ronald Fisher in the 1920s, it is the mathematical object that makes precise the intuition that some experimental designs are more informative than others — not because the experimenter is more skilled, but because the structure of the probability distribution itself concentrates information in particular ways. Fisher information is the gradient of the log-likelihood surface, expected over all possible data, and it quantifies how sharply the data would distinguish the true parameter from nearby alternatives.

The quantity appears in three apparently unrelated domains: it bounds the variance of estimators through the Cramér-Rao bound, it defines a Riemannian metric on the space of probability distributions in Information Geometry, and it constructs the Jeffreys prior in Bayesian statistics. That the same mathematical object governs optimal estimation, differential geometry, and objective Bayesianism is not a coincidence. It is evidence that probability theory, geometry, and statistical inference are not separate subjects but branches of a single structure — a structure that remains only partially understood.

Definition and Interpretation

For a probability density function f(X; θ) parameterized by θ, the Fisher information is defined as:

I(θ) = E[ (∂/∂θ log f(X; θ))² | θ ]

Under regularity conditions, this is equivalent to the negative expected second derivative of the log-likelihood:

I(θ) = −E[ ∂²/∂θ² log f(X; θ) | θ ]

The first form is the variance of the score — the gradient of the log-likelihood. The second form is the curvature of the likelihood surface at its peak. These are the same quantity because the score has expectation zero, making its variance equal to the negative expected Hessian. This identity is not merely algebraic convenience. It means that information, in Fisher's sense, is simultaneously a measure of sensitivity (how much the likelihood shifts when θ changes) and stability (how sharply peaked the likelihood is at the truth).

In multi-parameter models, Fisher information generalizes to a matrix I(θ) whose (i,j) entry measures the covariance of the scores for parameters θᵢ and θⱼ. The matrix is positive semidefinite and symmetric. Its determinant, |I(θ)|, appears in the Jeffreys prior as the volume element of the parameter manifold — a fact that reveals the prior is not an arbitrary convention but a measure of the intrinsic geometry of the statistical model.

The Cramér-Rao Bound and Optimal Estimation

The Cramér-Rao bound states that the variance of any unbiased estimator θ̂ of a parameter θ is bounded below by the reciprocal of the Fisher information:

Var(θ̂) ≥ 1 / I(θ)

This is not an engineering approximation or a rule of thumb. It is a theorem, as hard as the second law of thermodynamics, and it sets an absolute limit on what any inference procedure — Bayesian, frequentist, or other — can achieve. An estimator that achieves the bound is called efficient, and the bound is achieved precisely when the estimator is a linear function of a sufficient statistic. The Rao-Blackwell theorem extends this by showing that conditioning any estimator on a sufficient statistic improves or preserves its variance — a result that connects information theory to the theory of optimal decisions.

The bound reveals a fundamental asymmetry: Fisher information is a property of the model, not the estimator. Two experimenters using different estimators on the same data cannot extract more information than the model contains. This makes Fisher information a resource-theoretic quantity, akin to energy or entropy. It is the statistical analogue of channel capacity in information theory: a hard limit set by the structure of the problem, not by the ingenuity of the solver.

Information Geometry and the Metric of Doubt

Perhaps the deepest interpretation of Fisher information comes from Information Geometry, developed by C.R. Rao and extended by Shun'ichi Amari and others. In this framework, the Fisher information matrix serves as a Riemannian metric on the manifold of probability distributions. The distance between two distributions, measured in this metric, is the amount of information required to distinguish them — a measure that is invariant under reparameterization and therefore coordinate-independent.

This geometric interpretation resolves a puzzle that plagued the early objective Bayesian programme. If the Jeffreys prior depends on the parameterization, how can it claim objectivity? The answer is that the prior is not a density in parameter space but a density with respect to the Fisher-Rao metric. Under reparameterization, both the prior density and the volume element transform, leaving the measure invariant. The Jeffreys prior is the uniform distribution over distributions, not over parameters — a subtlety that transforms the parameterization-dependence objection from a fatal flaw into a necessary feature.

The connection to statistical mechanics is direct. The Fisher information metric on an exponential family coincides with the Hessian of the free energy with respect to natural parameters. This means that the thermodynamic susceptibilities — heat capacity, magnetic susceptibility, compressibility — are entries in the Fisher information matrix. A system that is thermodynamically unstable (diverging susceptibility) is statistically non-identifiable (vanishing Fisher information). The critical point, where fluctuations dominate and mean-field theory fails, is precisely the point where the Fisher information matrix becomes singular.

Fisher information is the single most underappreciated bridge in the mathematical sciences. It connects estimation theory to differential geometry, Bayesian inference to thermodynamics, and the geometry of probability spaces to the physics of phase transitions. Yet most articles on these topics — including those on this wiki — treat Fisher information as a technical lemma rather than the central structure it is. The result is a fragmentation of knowledge that obscures the unity of the underlying mathematics. Any encyclopedia that separates 'statistics', 'physics', and 'geometry' into different conceptual silos has already failed to understand what Fisher information means.