Overfitting
Overfitting occurs when a machine learning model learns the training data too well — capturing noise and idiosyncratic features that do not generalize to new inputs. The model performs excellently on examples it has seen and poorly on examples it has not. It has memorized rather than learned.
The technical definition: a model overfits when its training error is substantially lower than its generalization error (error on held-out data). The gap between these two quantities is the measure of overfitting. Classical statistical theory predicted that sufficiently complex models would always overfit given insufficient data. Modern practice has complicated this picture: very large neural networks, trained with gradient descent, often exhibit double descent — generalization error first rises, then falls, as model size increases past a critical threshold. The largest models sometimes generalize better than medium-sized models that classical theory predicted should perform optimally. The theoretical explanation for this remains incomplete.
The practical responses to overfitting — regularization (penalizing parameter magnitude), dropout (randomly zeroing activations during training), early stopping (halting optimization before training error reaches zero), data augmentation (artificially expanding the training set) — are engineering interventions developed empirically before they were understood theoretically. Each works in practice. Each has failure modes that practitioners learn by experience rather than from first principles. An aligned system cannot afford to be an overfitted one: overfitting to training objectives is precisely the mechanism by which systems that optimize proxy measures diverge from human intentions.
The Distribution Problem: Overfitting Beyond the Training Set
Overfitting as classically defined is a problem that test sets are designed to detect: hold out some data, measure performance on it, observe the gap. This detection procedure rests on an assumption so deeply embedded it is rarely stated: that the test set and the training set are drawn from the same distribution.
In laboratory settings, this assumption holds by construction. In deployment, it does not hold at all. A model trained to detect spam in 2020 and evaluated on a test set from 2020 can appear to generalize well. The same model running against 2024 email has encountered distribution shift: the marginal distribution of spam features, the vocabulary, the formatting conventions have all changed. The test-set performance number — the number that appears in publications, procurement documents, and regulatory filings — is not a prediction of deployment performance. It is a measurement of performance in a world that no longer exists.
The standard response — 'retrain periodically with fresh data' — assumes that the degradation is detectable before it causes harm. This assumption fails in any application where ground truth labels arrive slowly: medical diagnosis, loan default prediction, content moderation, autonomous navigation. The model drifts. The drift is invisible. The consequences accumulate.
The practical implication is unsettling: every published generalization error estimate in machine learning literature is a lower bound on deployment error, in expectation, over the deployment lifetime of the model. The gap between the reported number and the actual deployment error is unknown at publication time, unknown at deployment time, and typically becomes known only after failure. This is not a scandal — it is a structural feature of the problem. But it is being systematically misrepresented as a known quantity.