Variational Inference

Variational inference is a framework for approximate probabilistic inference that replaces integration — computing a posterior distribution by summing or integrating over all possible parameter values — with optimization. The core idea is to select a simpler family of distributions, then find the member of that family that is closest to the true posterior according to some measure of dissimilarity, typically the Kullback-Leibler divergence. The result is not the true posterior but a tractable approximation, obtained at computational cost that scales polynomially rather than exponentially with model complexity.

The framework is central to modern machine learning because exact Bayesian inference is intractable for all but the simplest models. In Bayesian neural networks, for instance, the posterior over millions of parameters cannot be computed in closed form. Variational inference provides a principled way to obtain uncertainty estimates without solving the full inference problem.

The Variational Objective

The standard variational objective is the evidence lower bound (ELBO). For a model with data D, latent variables Z, and parameters θ, the log-evidence log p(D) is intractable. The ELBO constructs a lower bound by introducing a variational distribution q(Z,θ) and decomposing:

log p(D) = ELBO(q) + KL(q || p)

where KL(q || p) is the Kullback-Leibler divergence from the variational approximation to the true posterior. Since KL divergence is non-negative, maximizing the ELBO is equivalent to minimizing the KL divergence — making q as close as possible to the true posterior while remaining computationally tractable.

This objective has a revealing information-theoretic interpretation. The ELBO is the sum of the expected log-likelihood (how well the model explains the data under the approximation) and the negative KL divergence from the prior (a regularization term penalizing approximations that stray too far from the prior). The balance between fit and regularization is not arbitrary; it is encoded in the information-theoretic structure of the inference problem itself.

Mean-Field and Structured Approximations

The simplest variational family is the mean-field approximation, which assumes all latent variables are independent: q(Z) = ∏ᵢ qᵢ(Zᵢ). This assumption makes optimization tractable — each factor can be updated independently — but it is often too restrictive. In models with strong dependencies between variables, the mean-field approximation can miss critical structure, producing overconfident or systematically biased posteriors.

More structured approximations relax the independence assumption while maintaining tractability. Expectation propagation approximates the posterior by matching moments rather than minimizing KL divergence. Normalizing flows transform a simple base distribution through a sequence of invertible mappings to produce a complex approximation. These methods trade off computational cost against approximation fidelity in ways that mirror the broader trade-offs in computational complexity theory.

Connections to Statistical Physics

Variational inference has deep — and often underappreciated — connections to statistical mechanics. The mean-field approximation in variational inference is formally identical to the mean-field approximation in physics, where interactions between particles are replaced by an average 'field' that each particle experiences. The variational free energy in physics is the negative ELBO in machine learning; both represent a trade-off between energy (fit to data) and entropy (complexity of the approximation).

This structural rhyme suggests that the tools of statistical physics — renormalization group methods, phase transition analysis, critical phenomena — may be applicable to understanding the behavior of variational approximations in high-dimensional spaces. A variational approximation that undergoes a 'phase transition' as model complexity increases might suddenly shift from accurate to catastrophically wrong, with no smooth intermediate regime. Understanding when and why this happens is an open problem at the intersection of optimization, statistical physics, and probabilistic modeling.

Variational inference is not a poor man's Bayesianism — it is a principled admission that exact truth is computationally inaccessible and that the shape of our ignorance matters as much as the shape of our knowledge. The KL divergence is not merely a technical convenience; it is a measure of how much structure we are willing to sacrifice for the privilege of having an answer.