PAC Learning

Probably Approximately Correct (PAC) learning is the foundational framework of statistical learning theory — the mathematical discipline that asks what it means for a machine to 'learn' from data, and under what conditions such learning is guaranteed. Introduced by Leslie Valiant in 1984, PAC learning replaces the vague aspiration of 'generalization' with a precisely stated contract: a learning algorithm is PAC if, given enough data, it can produce a hypothesis that is approximately correct with high probability. The name itself is a philosophical stance: the framework does not promise certainty, only probable approximate correctness. This is not modesty but realism. In a world of finite data and noisy observations, 'probably approximately correct' is the most honesty that any learning theory can offer.

The formal definition is elegant in its restraint. Given a hypothesis class H, a distribution D over inputs, and parameters ε (accuracy) and δ (confidence), a learning algorithm is PAC if for any target concept in H, it outputs a hypothesis h such that with probability at least 1−δ, the true error of h is at most ε. The number of samples required to achieve this guarantee is the 'sample complexity' of the class. The key question is not whether learning is possible but how much data it costs — and that cost is determined by the complexity of H.

The Complexity Tax: What Learning Costs

The central insight of PAC theory is that learning has a price, and the price is measured in complexity. The two dominant currencies of this price are the VC dimension and Rademacher complexity. The VC dimension measures the worst-case expressive capacity of a hypothesis class: how many points can it shatter? The Rademacher complexity measures a more refined quantity: the expected ability of the class to fit random noise, which depends on the actual data distribution rather than a worst-case abstraction.

Both complexity measures enter into generalization bounds through the machinery of uniform convergence. The classical result is the Vapnik-Chervonenkis bound: with probability 1−δ, the difference between empirical risk and true risk is bounded by a function of √(d/n), where d is the VC dimension and n is the sample size. This bound is distribution-free — it holds for any data-generating distribution — which is both its strength and its weakness. The generality comes at the cost of looseness: the bound is typically far more conservative than empirical reality.

Rademacher complexity offers a tighter, data-dependent alternative. Because it measures the class's ability to fit random noise on the actual data, it captures structure that the worst-case VC analysis ignores. The tradeoff is computational: Rademacher complexity is harder to estimate than VC dimension, and the tighter bounds require more work to compute. This mirrors a broader pattern in learning theory: the more you know about your problem, the better your guarantees can be, but the more you must pay in analysis.

PAC Learning as a Theory of Bounded Rationality

The systems-theoretic reading of PAC learning is that it is a formal theory of bounded rationality. The framework does not ask: 'What is the true hypothesis?' It asks: 'What can be reliably inferred given limited data and limited computational resources?' The ε and δ parameters are not arbitrary tolerances. They are the formal expression of a fundamental constraint: any agent learning from finite experience must accept some probability of error and some distance from perfection.

This reframing connects PAC learning to the broader study of how complex systems make decisions under uncertainty. In information theory, the channel capacity theorem states that reliable communication is possible up to a rate limit, but never beyond it. In PAC learning, the sample complexity is the analogous rate limit: reliable learning is possible up to a complexity limit, but never beyond it. The two limits are structurally identical. Both arise from the fact that finite resources — bandwidth for communication, data for learning — impose hard constraints on what can be achieved. The mathematician who sees only equations misses the architecture: PAC learning is the information theory of induction.

The connection runs deeper. The mutual information between a hypothesis and the data that selected it measures how much the data has reduced our uncertainty about the target. A PAC guarantee is essentially a statement that this mutual information is bounded below by a function of the sample size and the complexity of the hypothesis class. Learning, in this view, is not the acquisition of knowledge but the reduction of uncertainty through structured interaction with an environment. The hypothesis is not a representation of truth; it is a compressed encoding of the data's regularities, subject to the constraints of the learning architecture.

The Deep Learning Crisis and What Survives

The classical PAC framework assumes that the hypothesis class is fixed and that the learning algorithm is ERM over that class. Modern deep learning violates both assumptions. The hypothesis class is not fixed — it is defined implicitly by the architecture and the training dynamics. The learning algorithm is not classical ERM — it is stochastic gradient descent with implicit regularization, early stopping, and data augmentation. The result is that neural networks operate in regimes where classical PAC bounds predict catastrophic overfitting, yet they generalize well.

This is not a refutation of PAC learning. It is a revelation of its domain. The classical framework treats the hypothesis class as an unstructured set and the learning algorithm as an arbitrary optimizer. Deep learning reveals that both assumptions are wrong: the effective hypothesis class is not the full parameter space but the manifold traced by the optimizer, and the optimizer itself is not neutral but biased toward simple solutions. The bounds that matter are not on the full class but on the 'effective complexity' of the functions actually explored during training.

What survives from PAC theory is the structural truth: 'generalization requires that the complexity of what is learned remains bounded relative to the information in the data.' The specific measures of complexity have shifted from VC dimension to neural tangent kernel norms, from Rademacher complexity to PAC-Bayesian bounds, and from worst-case analysis to data-dependent geometry. But the fundamental contract remains. Any learning system that generalizes must, in some sense, be doing what PAC theory describes — even if the classical vocabulary no longer fits.

PAC learning is not a certificate of safety. It is a language for asking the right questions. The question is not 'Does this model generalize?' but 'What is the effective complexity of the functions this training procedure actually explores, and is that complexity bounded by the information in the data?' Answering that question requires tools that do not yet exist. But the framework for asking it — the insistence on probable, approximate, bounded correctness — is the most durable contribution of the theory.