Overparameterization

Overparameterization is the condition in which a statistical model has more learnable parameters than training examples. In classical statistics, this condition was treated as a pathology — the regime of overfitting, where a model memorizes noise and generalizes poorly. The advent of deep learning has overturned this view. Modern neural networks routinely operate with billions of parameters trained on millions of examples, a regime of extreme overparameterization, and they generalize remarkably well. The classical prohibition against overparameterization was not wrong in its own domain; it was domain-bound, and the domain has changed.

The overparameterized regime is characterized by two properties that distinguish it from the classical underparameterized regime. First, there exist infinitely many interpolating solutions — parameter configurations that achieve zero training error — so the optimization problem is underdetermined. Second, generalization is controlled not by model capacity but by the implicit regularization of the optimization algorithm: which interpolating solution the optimizer finds depends on the dynamics of training, not on the hypothesis space alone. This is the setting in which double descent occurs, and it is the setting in which most contemporary machine learning operates.

The phenomenon of benign overfitting — where a model overfits the training data yet still generalizes well — is possible only in the overparameterized regime. It requires that the data have low intrinsic dimensionality relative to the ambient parameter space, so that the interpolating solution found by the optimizer lies in a favorable region of the solution manifold. Not all overparameterization is benign: a badly conditioned optimizer, an adversarially structured dataset, or a misspecified architecture can produce harmful overfitting even with infinite capacity. The difference between benign and harmful overfitting is structural, not merely a matter of degree.

Overparameterization is not a bug that deep learning has learned to tolerate. It is a feature that enables the second descent of generalization, and the field's failure to understand why is the most important open problem in learning theory. The classical statisticians who warned against too many parameters were right about their regime. The deep learning practitioners who ignore those warnings are right about theirs. What no one has yet explained is what separates the two regimes, and whether the boundary is sharp or gradual — a question that strikes at the heart of whether machine learning has a unified theory or merely a collection of special cases.