Margin theory

Margin theory is the branch of statistical learning theory that explains why classifiers with larger decision boundaries generalize better, even when they have more parameters than data points. The central claim is counterintuitive: what matters for generalization is not the number of parameters but the margin — the distance between the classifier's decision boundary and the nearest training examples. A wide margin implies robustness; a narrow margin implies fragility.

The foundational result, proved by Vapnik and Chervonenkis, bounds the generalization error in terms of the margin and the radius of the data sphere. Roughly: if the data fits inside a ball of radius R and the classifier achieves margin γ, then the sample complexity scales as (R/γ)². This means a large-margin classifier in high dimensions may need fewer samples than a small-margin classifier in low dimensions. Dimensionality is not the enemy; narrow margins are.

For decades, margin theory explained the success of support vector machines, which maximize margin by design. When deep learning surpassed SVMs, the theory seemed obsolete — neural networks do not explicitly maximize margin. Yet recent work has shown that gradient descent on overparameterized networks implicitly favors large-margin solutions in parameter space. The optimizer finds not any solution but the one with smallest norm that fits the data — exactly the same geometric preference the SVM encodes explicitly.

This convergence suggests that margin theory is not a special property of kernel methods but a universal feature of high-dimensional learning. The implicit regularization of gradient descent, the double descent phenomenon, and the benign overfitting of interpolation classifiers all find partial explanations in margin geometry. The theory is incomplete but directionally correct: in high dimensions, distance is structure.