Information theory

Information theory is the mathematical study of the quantification, storage, and communication of information. Founded by Claude Shannon's landmark 1948 paper A Mathematical Theory of Communication, it provides the formal language in which the fundamental limits of all communication systems — digital, biological, and otherwise — can be precisely stated. Shannon's core insight was that information can be defined independently of meaning: what matters for communication engineering is not what a message says, but how much uncertainty it resolves.

The field has since expanded far beyond telecommunications, becoming a foundational framework for statistical mechanics, computational complexity, machine learning, genetics, and neuroscience. Information-theoretic limits appear wherever there is noise, compression, or inference — which is everywhere in the physical and computational world.

Shannon Entropy: Uncertainty as Information

The central quantity of information theory is Shannon entropy, denoted H. For a discrete probability distribution over outcomes x₁, ..., xₙ with probabilities p₁, ..., pₙ, the entropy is:

H(X) = -Σ pᵢ log₂(pᵢ)

This quantity measures the average uncertainty about the outcome of a random variable — equivalently, the average number of bits required to communicate the outcome of X to a receiver who knows the distribution but not the specific result. A fair coin has entropy 1 bit. A loaded coin that always comes up heads has entropy 0 bits — no message is needed because there is no uncertainty.

The elegance of Shannon entropy is that it is the unique function satisfying three intuitively necessary axioms: continuity (small changes in probability produce small changes in entropy), symmetry (the order in which outcomes are listed does not matter), and recursion (the entropy of a composite experiment equals the entropy of the first stage plus the conditional entropy of the second stage given the first). These axioms uniquely determine the logarithmic form — the formula is not a choice but a theorem.

Channel Capacity and the Fundamental Limits

Shannon's channel coding theorem establishes the channel capacity C as the maximum rate at which information can be transmitted over a noisy channel with arbitrarily small error probability. For a channel with noise, the capacity is:

C = max I(X; Y)

where the maximum is taken over all input distributions, and I(X; Y) is the mutual information between channel input X and channel output Y.

The theorem's implications are non-intuitive: no matter how noisy the channel, there exists a coding scheme that achieves transmission rates arbitrarily close to C with arbitrarily small error. But for any rate above C, the error probability is bounded away from zero regardless of the coding scheme. This is a hard limit set by mathematics, not engineering. Better hardware can push you closer to the limit; no hardware can cross it.

This result transformed telecommunications engineering. Before Shannon, engineers believed that reducing noise required reducing transmission rate — that these were trading variables. Shannon showed they are not. Once you are coding correctly, the tradeoff disappears: up to capacity, you can have both speed and reliability. The insight liberated the field: the right problem was not to reduce noise but to find optimal codes.

The Connection to Physics

The relationship between Shannon entropy and thermodynamic entropy is more than analogical. Boltzmann's entropy formula S = k log W defines thermodynamic entropy as the logarithm of the number of microstates compatible with a macrostate. Shannon entropy is the logarithm of the number of typical sequences of a source. Both measure, in different units and with different constants, the same underlying quantity: the logarithm of the size of the set of possibilities consistent with what is known.

The physicist Leo Szilard showed in 1929 — before Shannon — that the acquisition of information about the state of a physical system is thermodynamically costly: one bit of information acquisition is associated with a reduction in entropy of k ln 2, and the erasure of one bit of stored information necessarily dissipates k ln 2 of free energy as heat. This result, known as Landauer's Principle, connects information theory to the Second Law of Thermodynamics and implies that computation has an irreducible thermodynamic cost: not the act of computation, but the erasure of memory.

The deep implication is that information is physical. It is not an abstract quantity floating free of matter. Every bit stored, transmitted, or erased has a physical substrate and a thermodynamic footprint. This is not merely a philosophical claim — it makes testable predictions about the minimum energy cost of computation that have been experimentally verified.

Mutual Information, Channels, and Inference

Mutual information I(X; Y) measures the amount of information that one random variable carries about another:

I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

It is symmetric: X tells us as much about Y as Y tells us about X. This symmetry is not obvious from the causal picture — if X causes Y, one might expect X to tell us more about Y than vice versa — but information theory is not a causal calculus. It measures statistical dependency, not causation.

The application to Bayesian inference is direct. Given observed data Y, the mutual information I(X; Y) measures how much the data reduces our uncertainty about the hypothesis X. A good experiment is one with high mutual information between experimental outcomes and hypotheses of interest. Kullback-Leibler divergence, a non-symmetric cousin of mutual information, measures how much a probability distribution P differs from a reference distribution Q:

D_KL(P || Q) = Σ pᵢ log(pᵢ/qᵢ)

KL divergence is the information lost when Q is used to approximate P — it appears throughout Bayesian statistics, variational inference, and predictive coding models of neural computation.

Algorithmic Information Theory

Shannon information is a property of probability distributions. Algorithmic information theory — developed independently by Kolmogorov, Solomonoff, and Chaitin in the 1960s — defines information as a property of individual objects. The Kolmogorov complexity K(x) of a string x is the length of the shortest program that produces x. A string is random if its shortest program is approximately as long as the string itself — no compression is possible. A string is structured if it has a compact description.

This definition captures intuitive notions of randomness and pattern in a way that probability-theoretic definitions cannot. The string 0101010101... has low Kolmogorov complexity (short description: 'print 01 fifty times') but technically maximal entropy under a uniform distribution over fixed-length strings. Algorithmic information theory disentangles these notions: entropy measures unpredictability over a distribution; complexity measures the intrinsic descriptive content of individual strings.

The limitation is computational: Kolmogorov complexity is not computable. There is no algorithm that, given a string x, correctly outputs K(x) for all x. This is not a practical limitation but a fundamental one — Chaitin's proof that K is uncomputable is closely related to the halting problem and to Gödel's incompleteness theorems. The most fundamental measure of information content is beyond the reach of any algorithm that computes it.

Information Theory Across Disciplines

Information theory has colonized fields that did not invent it, often productively. In molecular biology, the genetic code is an information channel — four-letter nucleotide sequences encode twenty-amino-acid sequences plus stop signals, and the channel capacity of the genetic code can be calculated and compared to the actual information content of protein-coding sequences. In neuroscience, neural populations have been analyzed as channels transmitting information about stimuli, and the metabolic cost of neural coding has been linked to thermodynamic information costs. In ecology, mutual information between species abundances has been used to infer food web structure without direct observation of feeding relationships.

In each case, information theory provides a language for precision — for distinguishing signal from noise, for quantifying what is and is not being communicated — that the native vocabulary of the field could not supply. This cross-disciplinary utility is not free: importing information-theoretic concepts often imports their assumptions, including the assumption that the relevant process can be modeled as a channel with a fixed noise structure. In systems where the noise structure itself evolves — in co-evolutionary arms races, in adaptive immune systems, in financial markets — the fixed-channel model is an idealization whose costs must be paid in interpretive care.

The deepest achievement of information theory is not the formula for channel capacity but the demonstration that the concept of information can be given a rigorous mathematical form — that 'how much information' is a question with a definite answer independent of what the information is about. Whether this formalization captures everything we care about when we speak of information, knowledge, and meaning is a question the formalism itself is not equipped to answer.