Bayesian inference
Core Concept
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Unlike frequentist inference, which treats parameters as fixed unknowns and data as random, Bayesian inference treats parameters as random variables with probability distributions that encode uncertainty.
The essential move: begin with a prior distribution P(H) representing belief about a hypothesis before seeing data. After observing evidence E, compute the posterior distribution P(H|E) via Bayes' theorem:
P(H|E) = P(E|H) * P(H) / P(E)
Where P(E|H) is the likelihood of the evidence under the hypothesis, and P(E) is the marginal probability of the evidence (often computed as sum over all hypotheses).
The Logic of Updating
Bayesian inference is not merely a formula. It is a model of learning — a formal account of how an agent should revise beliefs in light of new information. The prior captures everything known or assumed before the experiment; the likelihood captures how probable the observed data would be under competing hypotheses; the posterior synthesizes them into a new state of belief.
This structure makes Bayesian inference naturally suited to sequential updating: today's posterior becomes tomorrow's prior. In fields where data arrives continuously — clinical trials, online learning, sensor networks — this sequentiality is not a convenience but a necessity.
Conjugate Priors and Computational Tractability
In the early twentieth century, the practical obstacle to Bayesian methods was computational. Evaluating the denominator P(E) requires integrating over the entire parameter space, a calculation that was analytically intractable for most realistic models and numerically infeasible before modern computing.
The concept of conjugate priors partially solved this: a prior is conjugate to a likelihood if the resulting posterior belongs to the same family of distributions as the prior. For example, a beta prior with a binomial likelihood yields a beta posterior. This analytic tractability made Bayesian inference workable for a limited but important class of models.
Markov Chain Monte Carlo and the Bayesian Revolution
The real transformation came with Markov Chain Monte Carlo (MCMC) methods, particularly the Metropolis-Hastings algorithm (1953, generalized 1970) and Gibbs sampling (1984). These techniques approximate posterior distributions by constructing Markov chains whose stationary distributions are the posteriors of interest. They do not require analytic integration; they require only that one can evaluate the posterior up to a proportionality constant.
MCMC made Bayesian inference practical for complex hierarchical models, mixture models, and latent variable structures that had resisted analytic treatment. The Bayesian revolution of the 1990s — in which Bayesian methods moved from theoretical statistics into applied fields — was driven less by philosophical conversion than by computational feasibility.
Bayesian Inference in Machine Learning
In machine learning, Bayesian inference appears in multiple guises:
- Bayesian parameter estimation: treating neural network weights as distributions rather than point estimates, yielding uncertainty quantification that frequentist methods struggle to provide.
- Gaussian processes: a non-parametric Bayesian approach to regression and classification that naturally captures predictive uncertainty.
- Bayesian optimization: sequential design strategy that uses a probabilistic surrogate model and an acquisition function to optimize expensive black-box functions — widely used in hyperparameter tuning and experimental design.
- Probabilistic programming: systems like Stan, PyMC, and Church that allow users to specify models in high-level languages and automate posterior inference.
The connection to algorithmic probability is foundational: both frameworks treat probability as a measure of belief rather than frequency, and both confront the problem of prior specification — how to represent ignorance without introducing unwarranted assumptions.
The Problem of Priors
The most persistent criticism of Bayesian inference concerns prior specification. If the posterior depends on the prior, and the prior is subjective, how can Bayesian inference claim objectivity? This is the problem of the priors.
Responses have multiplied over decades:
- Objective Bayesians seek priors that encode maximal ignorance — uniform priors, Jeffreys priors, reference priors — that are determined by the likelihood structure rather than personal belief.
- Subjective Bayesians embrace the prior as a virtue: it makes assumptions explicit and auditable, whereas frequentist methods hide their assumptions in the choice of estimator and confidence procedure.
- Empirical Bayesians estimate priors from the data itself, blurring the line between Bayesian and frequentist approaches.
- Hierarchical Bayesians model the prior as itself uncertain, placing hyperpriors on hyperparameters and letting the data inform the prior structure.
The modern consensus is pragmatic: priors matter less than critics feared when data is abundant, and matter more than enthusiasts admit when data is scarce. The robustness of Bayesian conclusions to prior variation is itself an empirical question.
Criticisms and Limits
Bayesian inference is not a universal solvent. Computational complexity remains a barrier: MCMC can be slow to converge for high-dimensional posteriors, and diagnosing convergence is an art as much as a science. Variational inference offers faster approximations but introduces bias that is difficult to quantify.
Model misspecification is another vulnerability: Bayesian updating is optimal only when the likelihood is correctly specified. A misspecified model can produce posteriors that are confidently wrong — more wrong, in some cases, than simple point estimators because the Bayesian machinery amplifies the error through the full distribution.
The Lindley paradox reveals that Bayesian and frequentist methods can disagree radically even with infinite data: a sharp null hypothesis can receive high posterior probability while being rejected by a significance test. This is not a bug in either framework but a symptom of their different ontologies — probability as belief versus probability as long-run frequency.
Connections
- Bayes' Theorem — the mathematical foundation
- Pierre-Simon Laplace — the historical pioneer of Bayesian methods
- Algorithmic Probability — Solomonoff's framework for induction
- Artificial intelligence — machine learning applications
- Asymmetric Information — economics of belief updating
- BQP — quantum computing's probabilistic complexity class
- Graphical Models — structured representations for Bayesian reasoning
- Statistical Mechanics — where Bayesian and physical ensembles converge
References and Further Reading
- Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian Theory. Wiley.
- Gelman, A., et al. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.
- Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press.
- McGrayne, S. B. (2011). The Theory That Would Not Die. Yale University Press. — historical account of Bayesian inference's contested acceptance.
- Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann.