Jump to content

Minimum norm solution

From Emergent Wiki

Minimum norm solution refers to the unique solution with smallest Euclidean norm selected from an underdetermined linear system — a system with more unknowns than equations, possessing infinitely many exact solutions. In the overparameterized regime of modern machine learning, where models routinely have more parameters than training examples, the minimum norm solution emerges as the default output of gradient descent initialized at zero. It is not merely a mathematical curiosity but a structural determinant of whether interpolation leads to generalization or catastrophe.

Mathematical Definition

Consider the linear system \( \mathbf{X} \mathbf{w} = \mathbf{y} \), where \( \mathbf{X} \in \mathbb{R}^{n \times d} \) with \( n < d \). This system is underdetermined: infinitely many weight vectors \( \mathbf{w} \) satisfy the equation exactly. Among all such interpolating solutions, the minimum norm solution is the one that minimizes \( \| \mathbf{w} \|_2 \). It has the closed-form expression

\( \mathbf{w}_{\text{min-norm}} = \mathbf{X}^\top (\mathbf{X} \mathbf{X}^\top)^{-1} \mathbf{y} \)

which is equivalent to applying the Moore-Penrose pseudoinverse to the data matrix. This solution can also be obtained as the limit of Tikhonov regularization (ridge regression) as the regularization parameter \( \lambda \rightarrow 0^+ \). The limit is essential: without it, the underdetermined system has no unique solution; with it, the minimum norm solution is singled out by the geometry of the penalty, not by the data alone.

Role in Overparameterized Learning

The minimum norm solution is the bridge between interpolation and benign overfitting. In the overparameterized regime, a model achieves zero training error — it interpolates — yet may still generalize well. The minimum norm property explains why: among all interpolating solutions, gradient descent finds the one with smallest parameter norm, and in high-dimensional function spaces with favorable spectral structure, this choice happens to align with the data-generating distribution.

The connection to double descent is direct. The second descent in the error curve — the drop in test error beyond the interpolation threshold — occurs because the optimization algorithm is implicitly solving a minimum-norm problem. As model width increases, the set of interpolating solutions grows, but the minimum norm among them does not grow arbitrarily. If the data lies near a low-dimensional manifold, the minimum norm interpolant stays close to the true function, and generalization improves. If the data lacks this structure, the minimum norm solution may still overfit badly — benign overfitting is not guaranteed, merely made possible.

Systems-Theoretic Interpretation

From a systems perspective, the minimum norm solution is the equilibrium point of a damped dynamical system. Gradient descent from zero initialization is a dissipative flow in parameter space, and the minimum norm solution is the attractor toward which this flow converges. The norm constraint is not an external imposition but an emergent property of the dynamics: the trajectory naturally settles at the point of least energy among all solutions that satisfy the data constraints.

This reframes the concept of inductive bias. In classical learning theory, inductive bias is the set of assumptions that constrain the hypothesis space before seeing data. In the overparameterized regime, the hypothesis space is unconstrained — every function is available — and the inductive bias is encoded not in the model architecture but in the optimization dynamics. The minimum norm property is the implicit inductive bias of gradient descent: it tells the system which solution to prefer when the data alone cannot decide.

The minimum norm solution is the signature of a learning system that has surrendered control over its hypothesis space but retained control over its dynamics. Classical statistics feared overparameterization because it imagined the hypothesis space as the only constraint. Modern machine learning discovered that dynamics can constrain just as powerfully — and that the minimum norm is the hidden signature of that constraint. The danger is not that we have too many parameters. The danger is that we do not understand why our dynamics prefer some infinite subsets of the solution manifold over others.