PAC-Bayes

PAC-Bayes is a framework in statistical learning theory that combines PAC learning with Bayesian inference to provide non-vacuous generalization bounds for complex models. Developed by David McAllester in the late 1990s, PAC-Bayes addresses a critical weakness of classical PAC theory: the bounds it provides are typically too loose to be useful for real-world classifiers like neural networks.

The central insight is to treat the learning algorithm as producing a distribution over hypotheses rather than a single hypothesis. Instead of bounding the error of one classifier, PAC-Bayes bounds the expected error of a classifier drawn from a 'posterior' distribution, relative to a 'prior' distribution chosen before seeing the data. The bound takes the form of a trade-off between the empirical risk on the training set and the Kullback-Leibler divergence between the posterior and the prior.

This framework has seen a resurgence because it can provide non-vacuous generalization bounds for deep neural networks — something classical uniform convergence approaches cannot do. The connection to information theory is direct: the KL divergence term measures how much information the data has injected into the posterior relative to the prior. When the posterior concentrates tightly around a single hypothesis, the KL term grows, and the bound tightens. When the posterior remains diffuse, the bound remains loose.

PAC-Bayes is not merely a technical refinement. It reframes the generalization problem as a problem about information compression: a model generalizes well if the training data can be compressed into a compact posterior distribution. This compression perspective connects learning theory to minimum description length principles and to the broader question of why simpler models tend to generalize better than complex ones.

PAC-Bayes as a Systems Principle

The PAC-Bayes bound is not just a theorem about classifiers. It is a theorem about how systems learn from finite information. The bound applies whenever a system must generalize from a sample to a population — and this condition is universal.

In biological evolution, the "prior" is the distribution of phenotypes produced by mutation and recombination; the "posterior" is the distribution after selection. The KL divergence between them measures how much information the environment has transmitted to the population. A population that adapts rapidly (tight posterior around a fit phenotype) has high KL divergence from its prior — it has learned a lot from selection. But the PAC-Bayes bound warns: rapid adaptation to one environment may mean poor generalization to another. The bound is the formalization of the evolutionary trade-off between specialization and evolvability.

In scientific inference, the prior is the set of hypotheses a community considers plausible before an experiment; the posterior is the set after the experiment. The KL divergence measures how much the experiment has shifted scientific belief. A revolutionary experiment (high KL divergence) changes the posterior dramatically; a null result (low KL divergence) leaves it unchanged. The PAC-Bayes bound predicts that scientific theories that explain a narrow range of phenomena very well (low empirical risk on a specific dataset) but require dramatic prior revision (high KL divergence) will fail to generalize to new phenomena.

In economic forecasting, the prior is the market's baseline model of the economy; the posterior is the model after observing new data. The KL divergence measures how much the data has changed market expectations. The bound suggests that forecasters who revise their models dramatically in response to each new data point (high KL divergence) will overfit and perform poorly out-of-sample. This is the formal justification for regularization in forecasting — the deliberate under-reaction to new information to preserve generalization.

The Compression-Emergence Connection

PAC-Bayes reveals a deep connection between compression and emergence. A system that compresses its observations into a compact representation (low KL divergence from a simple prior) is a system that has discovered structure in its environment. The compression is not merely efficient storage; it is the extraction of regularities that support prediction.

This connects to emergence: emergent properties are precisely those that can be compactly described at a macro-level but not at the micro-level. The emergence of a new level of organization — the appearance of a giant component in a random graph, the crystallization of a liquid, the formation of a concept in a neural network — is accompanied by a compression event: the macro-description becomes more efficient than the micro-description. PAC-Bayes quantifies this: the bound tightens when the posterior discovers a compressed representation of the data.

PAC-Bayes is often presented as a technical tool for proving generalization bounds, and it is that. But it is also something larger: a formal statement of the relationship between what a system has seen, what it believes, and what it can predict. In a world where AI systems are trained on ever-larger datasets and deployed in ever-more-critical contexts, the PAC-Bayes bound is not merely a mathematical curiosity. It is a warning. The bound says: you cannot compress more than the information in the data, and if your posterior is too complex relative to your sample, you are fooling yourself. The history of overfitting — from polynomial regression in the 1970s to deep learning in the 2020s — is a history of ignoring this warning.