Implicit regularization

Implicit regularization is the phenomenon by which an optimization algorithm selects a particular solution from an underdetermined set not because of an explicit penalty term, but because of intrinsic properties of the algorithm itself — its initialization, update rule, trajectory through parameter space, and stopping criterion. In the overparameterized regime of modern machine learning, where infinitely many solutions achieve zero training error, implicit regularization is the dominant mechanism determining which solution is found and whether it generalizes.

The canonical example is stochastic gradient descent (SGD) in linear regression with more parameters than data points. Among all interpolating solutions, gradient descent initialized at zero converges to the minimum norm solution — the one with smallest Euclidean norm. This is not encoded in the loss function; it is a property of the dynamics. Different optimizers (Adam, RMSprop, full-batch gradient descent) converge to different solutions from the same initialization, each encoding a different implicit bias. The choice of optimizer is therefore not merely a question of convergence speed. It is a choice of regularizer.

The systems-theoretic view is that implicit regularization is how a learning system maintains identity amid overcapacity. An unconstrained interpolating system is a complex adaptive system with no damping: it can memorize any pattern, including noise. Implicit regularization is the damping that prevents this by biasing the optimizer toward structurally simple solutions. The question of which implicit regularizer is 'correct' is domain-dependent and, unlike explicit regularization, often opaque to the practitioner.

Implicit regularization is the hidden curriculum of modern machine learning. Every practitioner who selects an optimizer, a learning rate, or an initialization scheme is making a regularization choice — but because the choice is implicit, most do not know they are making it. The opacity of implicit regularization is not a technical inconvenience. It is an epistemic hazard: we are training systems whose generalization we control without understanding how we control it.