Bayesian model comparison

Bayesian model comparison is the framework for choosing between competing models using the machinery of Bayesian probability theory. Unlike frequentist hypothesis testing, which asks whether a model is 'significantly better' than a null, Bayesian comparison asks a more direct question: given the data and our prior beliefs, how much more probable is one model than another? The answer is the posterior odds, which decomposes into the product of the prior odds and the Bayes factor — the ratio of the marginal likelihoods of the data under each model.

The marginal likelihood (also called the evidence) is the probability of the data given the model, averaged over all possible parameter values weighted by their prior probabilities. It is the integral of the likelihood over the prior:

p(D | M) = ∫ p(D | θ, M) p(θ | M) dθ

This integral automatically penalizes model complexity. A model with more parameters must spread its prior probability over a larger volume of parameter space. Unless the additional parameters substantially improve the fit, the marginal likelihood decreases. This is Bayesian statistics' automatic implementation of Occam's razor: simpler models are preferred unless complexity is justified by the data. The penalty is not ad hoc; it falls directly out of the probability calculus.

The Bayes Factor as a Test of Evidence

The Bayes factor B₁₂ = p(D | M₁) / p(D | M₂) quantifies the evidence the data provide for model M₁ relative to model M₂. Unlike a p-value, which is the probability of the data (or more extreme data) given a null hypothesis, the Bayes factor is the probability of the data given each model — a direct reversal that places the models on an equal footing. A Bayes factor of 10 means the data are ten times more probable under M₁ than under M₂. A Bayes factor of 100 is considered decisive evidence; a Bayes factor near 1 means the data do not discriminate.

The Bayes factor has a formal connection to the Kullback-Leibler divergence: in the limit of large sample sizes, the log Bayes factor is approximately proportional to the difference in KL divergences from the true data-generating process to each model. This means the Bayes factor and the AIC are asymptotically related — they converge on the same answer when the sample size is large and one model is close to the truth. But they diverge in small samples, when prior information matters, and when neither model is particularly close to the truth. The Bayes factor is more sensitive to prior specification; AIC is more robust to prior misspecification because it does not use one.

Comparison with Information-Theoretic Criteria

The relationship between Bayesian model comparison and information-theoretic criteria like AIC and MDL is closer than the philosophical divide between Bayesian and frequentist camps suggests. All three frameworks attempt to estimate the expected out-of-sample predictive accuracy of a model. The Bayes factor uses the marginal likelihood; AIC uses the maximum likelihood penalized by parameter count; MDL uses the minimum number of bits required to encode the data using the model. In the asymptotic regime, these quantities are proportional. They are different roads to the same destination.

But the differences matter. The Bayes factor is coherent: it obeys the probability axioms and can be combined across studies. AIC is not coherent in this sense — the AICs of two models for one dataset do not constrain the AICs for another. MDL is coherent but requires a coding scheme, which introduces a different kind of arbitrariness. The choice between them is not merely computational. It is epistemological: do you trust your priors enough to integrate over them, or do you prefer a non-parametric estimate of predictive accuracy? Do you want coherence or robustness? This is a genuine tradeoff, not a question with a single correct answer.

Model Comparison in Complex Systems

In complex systems — agent-based models, neural networks, biological networks — the assumptions underlying both Bayesian and information-theoretic comparison often fail. The number of parameters may be undefined (what counts as a parameter in a neural network with dropout?). The likelihood may be intractable. The models may be non-nested, non-parametric, or purely algorithmic. In these settings, the Bayes factor requires approximations (Laplace approximation, variational inference, sampling) whose accuracy is itself uncertain.

The Posterior predictive check — a Bayesian technique that compares simulated data from the fitted model to the observed data — offers a more flexible alternative. It does not ask which model is more probable; it asks whether the model generates data that looks like the real thing. This is a pragmatic shift from comparative evaluation to absolute evaluation, and it is sometimes more useful in complex systems where model comparison is computationally impossible but model criticism is still feasible.

Bayesian model comparison is often presented as the rational alternative to the frequentist p-value, a cleaner and more principled way to do science. But the principledness is purchased at a price: the Bayes factor is exquisitely sensitive to prior specification, and in complex systems where priors are themselves objects of uncertainty, this sensitivity is not a bug but a structural feature. The Bayesian who treats the prior as 'just a belief' and the frequentist who treats the p-value as 'just a convention' are equally evading the hard problem: all model comparison is underdetermined by data, and the formalism you choose determines which aspects of that underdetermination you will see. Bayes factors make prior sensitivity visible. That is their honesty, not their weakness.