Laplace approximation

The Laplace approximation is a method for approximating integrals that arise in Bayesian statistics and related fields. In its most common form, it approximates a posterior distribution by a Gaussian centered at the mode of the posterior, with a covariance matrix determined by the inverse of the Hessian of the log-posterior at that mode. The method transforms intractable high-dimensional integrals into tractable Gaussian integrals, making it a workhorse of computational Bayesian inference.

The approximation was developed by Pierre-Simon Laplace in the 18th century but was largely forgotten until the revival of Bayesian methods in the late 20th century. It is now the default first approximation in Bayesian model comparison because it provides a closed-form estimate of the marginal likelihood — the quantity that underlies the Bayes factor.

The Gaussian Approximation

Given a posterior distribution p(θ | D) ∝ p(D | θ) p(θ), the Laplace approximation finds the mode θ̂ (the maximum a posteriori estimate) and expands the log-posterior around this point using a second-order Taylor series:

log p(θ | D) ≈ log p(θ̂ | D) - 1/2 (θ - θ̂)^T H (θ - θ̂)

where H is the Hessian matrix of the negative log-posterior evaluated at θ̂. Exponentiating this approximation yields a Gaussian density, and the normalizing constant of that Gaussian gives the approximate marginal likelihood:

p(D) ≈ p(D | θ̂) p(θ̂) (2π)^{d/2} |H|^{-1/2}

where d is the dimension of the parameter space. This formula reveals the automatic Occam's razor effect: the determinant term |H|^{-1/2} penalizes model complexity by shrinking the marginal likelihood as the parameter space grows.

Connection to Information Criteria

In the limit of large sample sizes, the Laplace approximation to the marginal likelihood simplifies to the Bayesian information criterion. The BIC drops the prior-dependent terms and retains only the leading-order dependence on sample size and parameter count, producing the familiar score:

BIC = -2 log p(D | θ̂) + d log n

This derivation reveals that BIC is not an arbitrary penalty but a large-sample approximation to the exact Bayesian marginal likelihood. The approximation is valid when the posterior is well-approximated by a Gaussian and the sample size is large relative to the number of parameters. In small samples, the full Laplace approximation — which retains the prior and the Hessian structure — is more accurate than BIC.

The Laplace approximation also connects to the Minimum description length framework. The term log |H| measures the coding cost of the parameters, and the entire marginal likelihood can be interpreted as the total description length of the data using the model. Both frameworks — Bayesian and information-theoretic — converge on the same Gaussian approximation, suggesting that the Laplace form is capturing something fundamental about how high-dimensional models compress data.

Limitations and Extensions

The Laplace approximation fails when the posterior is multimodal, heavily skewed, or constrained to a non-Euclidean manifold. In complex systems — neural networks, agent-based models, hierarchical Bayesian models — the posterior landscape is often rugged, and the Gaussian assumption around a single mode can be catastrophically wrong. The approximation also requires that the Hessian be positive definite, which is not guaranteed for models with non-identified parameters or flat directions in the likelihood.

When the Laplace approximation fails, practitioners turn to variational inference (which optimizes a simpler family of distributions) or sampling methods (which avoid parametric assumptions entirely). But these methods are computationally expensive, and the Laplace approximation remains the preferred first approach for screening models before committing to more intensive computation.

The Laplace approximation is often dismissed as a crude first step — a Gaussian band-aid applied to a messy posterior. But this dismissal misses the deeper point: the approximation succeeds precisely when the posterior has concentrated around a single coherent solution, and it fails precisely when the model is underspecified or the data are ambiguous. In this sense, the Laplace approximation is not merely a computational convenience; it is a diagnostic. A posterior that cannot be Laplace-approximated is a posterior that has not yet made up its mind. The approximation's failure is more informative than its success. A Bayesian who refuses to check whether the Laplace approximation holds before running an expensive MCMC sampler is not being rigorous — they are being computationally lazy, substituting runtime for epistemic discipline.