Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC) is a class of algorithms for sampling from probability distributions by constructing a Markov chain that has the desired distribution as its equilibrium (stationary) distribution. The samples generated by the chain are not independent — each depends on the previous one — but under certain conditions the chain converges to the target distribution, and the samples become representative. MCMC is the computational engine of modern Bayesian inference, where the posterior distribution over parameters is often high-dimensional, analytically intractable, and impossible to sample from directly.

The central idea is deceptively simple: instead of trying to compute the target distribution directly, construct a random walk that explores the space of possible states in proportion to their probability. The walk may be inefficient — it can get stuck in local modes, mix slowly across dimensions, or take thousands of iterations to converge — but it has a crucial property: it works when almost nothing else does. The rise of MCMC in the 1990s transformed Bayesian statistics from a philosophical program into a practical methodology, because for the first time statisticians could compute posterior distributions for models of realistic complexity.

The Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm, developed by Nicholas Metropolis and colleagues in 1953 and generalized by W. K. Hastings in 1970, is the foundational MCMC method. Given a target distribution π(x) (known up to a normalizing constant) and a proposal distribution q(x'|x), the algorithm iterates:

Propose a new state x' from the current state x using q
Compute the acceptance probability α = min(1, π(x')q(x|x') / π(x)q(x'|x))
Accept x' with probability α; otherwise remain at x

The remarkable fact is that this procedure, repeated, produces a Markov chain whose stationary distribution is exactly π(x). The proposal can be almost anything — a random walk, a deterministic perturbation, even a poorly chosen distribution — and the chain will still converge to the right target, though the rate of convergence depends heavily on the quality of the proposal.

The algorithm does not require knowing the normalizing constant of π, which is the property that makes MCMC applicable to Bayesian inference. The posterior distribution p(θ|data) is proportional to p(data|θ)p(θ), and the normalizing constant p(data) — the integral of the likelihood over all parameters — is typically intractable. Metropolis-Hastings only requires ratios of π, so the normalizing constants cancel. This is why MCMC and Bayesian inference are coupled: the problem that makes Bayesian inference hard (computing the marginal likelihood) is the problem that MCMC sidesteps.

Gibbs Sampling and the Gibbs-Markov Connection

Gibbs sampling, introduced by Stuart and Donald Geman in 1984, is a special case of Metropolis-Hastings where the proposal is the conditional distribution of each variable given all the others. The algorithm cycles through variables, sampling each from its conditional posterior while holding the others fixed. The acceptance probability is always 1 — every proposed move is accepted — which makes Gibbs sampling simple to implement when the conditional distributions are tractable.

The connection to Markov chains is explicit: Gibbs sampling constructs a Markov chain in the space of all variables whose transition kernel is the product of the conditional distributions. The chain is guaranteed to converge to the joint posterior if the joint distribution satisfies a mild condition called connectedness — essentially, that the variables are not partitioned into disconnected groups. The convergence is not necessarily fast. Variables that are highly correlated in the posterior may cause the Gibbs chain to mix very slowly, because changing one variable while holding the others fixed produces only small steps in the joint space.

Convergence, Diagnostics, and the Problem of Knowing When to Stop

The deepest practical problem in MCMC is determining whether the chain has converged. The theory guarantees convergence in the limit — as the number of iterations approaches infinity, the distribution of samples approaches the target — but it provides almost no guidance on how many iterations are needed for a finite run. In practice, MCMC users rely on convergence diagnostics: the Gelman-Rubin statistic, which compares the variance within and between multiple chains; effective sample size, which measures how many independent samples the correlated MCMC draws are equivalent to; and visual inspection of trace plots for evidence of non-stationarity.

These diagnostics are necessary but insufficient. A chain can pass all diagnostic tests and still be far from convergence if the target distribution has a mode that the chain has not yet discovered. The problem is fundamentally one of exploration: MCMC is a local algorithm (each step depends only on the current state) trying to explore a potentially global structure (the target distribution). There is no general solution to this problem. The only guarantee is asymptotic, and the asymptote may be unreachable in practice.

Applications and the Emergence of Approximate Inference

MCMC has been applied across virtually every quantitative discipline: phylogenetics (where it samples tree structures), population genetics (where it samples demographic parameters), natural language processing (where it samples parse trees and topic models), statistical physics (where it samples spin configurations), and machine learning (where it samples neural network weights and hyperparameters). In each domain, the same pattern recurs: a model is specified, its posterior is intractable, and MCMC provides approximate samples that enable inference, prediction, and model comparison.

The rise of variational inference — an alternative approximate method that optimizes a simpler distribution to match the target — has partly displaced MCMC in large-scale machine learning, because variational methods are typically faster and easier to scale. But the displacement is not total. Variational inference optimizes a lower bound on the marginal likelihood, and the bound may be loose; MCMC, despite its slowness, is asymptotically exact. The choice between MCMC and variational methods is a trade-off between computational cost and approximation error, and the right choice depends on the application.

The deeper significance of MCMC is not merely as a computational tool but as a demonstration of a general principle: complex distributions can be explored by simple local rules, provided the rules are designed to respect the target structure. The Markov chain is a self-organizing system that finds its equilibrium without global coordination. The convergence of the chain is an emergent property of the local transition rules, not a designed feature of the algorithm. This is why MCMC has attracted interest from statistical physicists, who recognize in it the same Monte Carlo methods used to study phase transitions and critical phenomena.

The triumph of MCMC is not that it solves Bayesian inference perfectly. It is that it solves it at all — that a simple random walk, designed with no knowledge of the global structure of the target, can explore distributions of arbitrary complexity and produce samples that are, in the limit, indistinguishable from the truth. The limit may be unreachable, the convergence may be slow, and the diagnostics may be imperfect. But the principle stands: local rules, consistently applied, generate global structure. This is emergence in its computational form.