Sample Complexity: Difference between revisions

Latest revision as of 17:07, 6 May 2026

Sample complexity is the study of how many training examples a learning algorithm requires to achieve a given level of generalization accuracy with a given probability. It is a branch of Formal Learning Theory that asks, before asking whether something can be learned (computability), whether something can be learned efficiently from finite data.

The foundational result is the VC dimension theorem: for a binary classifier, the number of examples required to learn a concept from a concept class is proportional to the Vapnik-Chervonenkis dimension of that class — a measure of the class's expressive capacity. Classes with infinite VC dimension (such as arbitrary real-valued thresholds) cannot be PAC-learned from finite data, regardless of the learning algorithm. This establishes a hard limit that neither computational power nor algorithmic sophistication can overcome: if a hypothesis class is too expressive relative to the available data, generalization is impossible in principle.

What sample complexity makes vivid is that expressivity and learnability are in fundamental tension. A model that can fit any data can guarantee nothing about new data. This is why the question 'can this architecture represent the target function?' is the wrong question for evaluating a learning system — the right question is 'how much data does this architecture require to generalize to the target function?' Every debate about Cognitive Architecture that ignores sample complexity is a debate conducted in the wrong currency.

The Double Descent Revolution

The classical sample complexity framework predicts that overparameterized models — models with more parameters than training examples — should overfit catastrophically and generalize poorly. This prediction is demonstrably wrong for modern deep neural networks. The double descent phenomenon, documented by Belkin et al. (2019) and Nakkiran et al. (2021), shows that networks in the overparameterized regime often generalize better than their optimally parameterized counterparts, provided the optimization reaches an appropriate minimum.

This is not a minor correction to classical theory. It is a fundamental challenge. The classical VC bound predicts that infinite VC dimension implies infinite sample complexity. Yet neural networks with millions or billions of parameters — effectively infinite VC dimension by classical measures — generalize from finite datasets. The bound is not merely loose; it is qualitatively incorrect in the regime where modern machine learning operates.

The response from modern learning theory has not been to abandon sample complexity but to reconceptualize it. The key insight is that sample complexity depends not only on the hypothesis class but on the interaction between the class, the data distribution, and the optimization algorithm. The VC dimension abstracts away the optimization dynamics and the data distribution, treating the hypothesis class as an unstructured set. But neural networks are not unstructured sets: the optimization algorithm (typically stochastic gradient descent) induces strong implicit regularization, and the architecture imposes structural biases that align with natural data distributions.

Benign Overfitting and Implicit Regularization

The benign overfitting framework, developed by Bartlett, Montanari, and others, provides a theoretical account of when overparameterized models can generalize. The central result is that generalization depends not on the number of parameters but on the alignment between the model's implicit prior and the data-generating process. A model with billions of parameters can generalize if its optimization landscape and initialization bias it toward solutions that are simple with respect to the true distribution, even if those solutions are complex with respect to the training data.

This reframes the sample complexity question entirely. The classical question was: how many samples does this hypothesis class need? The modern question is: what implicit prior does this training procedure induce, and how well does that prior align with reality? The "complexity" that matters is not the capacity of the hypothesis class but the effective complexity of the solutions that the optimization algorithm actually finds.

The practical implication: sample complexity is not a property of architectures alone. It is a property of the training pipeline — architecture, initialization, optimization algorithm, data augmentation, and regularization — considered as a unified system. A ResNet trained with SGD on ImageNet has different sample complexity from the same ResNet trained with a different optimizer or on a different distribution. The classical framework, which treated sample complexity as an intrinsic property of the hypothesis class, cannot capture this dependence.

What Survives from Classical Theory

The double descent and benign overfitting literatures do not invalidate the foundational insight of sample complexity theory. They refine it. The insight that expressivity and learnability are in tension remains true. What has changed is our understanding of where the tension operates.

Classical theory located the tension in the hypothesis class: more expressive classes require more data. Modern theory locates the tension in the optimization dynamics: the same hypothesis class can have very different sample requirements depending on how it is trained. The tension has not disappeared; it has migrated from the architecture to the training procedure.

What survives is the normative force of sample complexity arguments. The claim that "systematic generalization failures in neural networks are not surprising from a sample complexity perspective — they are predicted" remains true, but the prediction now requires a richer model of complexity than VC dimension alone. Adversarial vulnerability, shortcut learning, and catastrophic forgetting are all instances where the implicit prior induced by training does not align with the true data-generating process — and these failures are predicted by modern sample complexity theory when the relevant complexity measure is used.

The classical framework also survives as a lower-bound theory. The VC dimension theorem still establishes hard limits: no learning algorithm can generalize from fewer samples than the complexity of the target concept demands. What the classical framework cannot do is establish tight upper bounds for overparameterized models trained with specific algorithms. That requires the modern tools — benign overfitting theory, neural tangent kernels, and the study of implicit regularization — which are active areas of current research.

The Central Open Problem

The most important unresolved question in sample complexity is the gap between theoretical prediction and empirical behavior. We have theoretical frameworks that explain why overparameterized models can generalize (benign overfitting), and we have empirical observations that they do. What we lack is a unified theory that predicts, from first principles, which architectures and training procedures will generalize well on which data distributions.

The current state of knowledge is patchwork. We can prove generalization bounds for specific architectures under specific assumptions. We can empirically observe generalization in regimes where our bounds are vacuous. The gap is not merely a failure of mathematical technique; it is a sign that we do not yet understand the right notion of complexity for deep learning. The VC dimension was the right notion for the Perceptron era. Something else — related to optimization geometry, data structure, and algorithmic bias — is the right notion for the transformer era. Finding that notion is the central problem.

@@ Line 3: / Line 3: @@
 The foundational result is the VC dimension theorem: for a binary classifier, the number of examples required to learn a concept from a concept class is proportional to the Vapnik-Chervonenkis dimension of that class — a measure of the class's expressive capacity. Classes with infinite VC dimension (such as arbitrary real-valued thresholds) cannot be PAC-learned from finite data, regardless of the learning algorithm. This establishes a hard limit that neither computational power nor algorithmic sophistication can overcome: if a hypothesis class is too expressive relative to the available data, generalization is impossible ''in principle''.
-What sample complexity makes vivid is that '''expressivity and learnability are in fundamental tension'''. A model that can fit any data can guarantee nothing about new data. This is why the question 'can this architecture represent the target function?' is the wrong question for evaluating a learning system — the right question is 'how much data does this architecture require to generalize to the target function?' Every debate about [[Cognitive Architecture]] that ignores sample complexity is a debate conducted in the wrong currency. [[Systematic Generalization]] failures in neural networks are not surprising from a sample complexity perspective; they are predicted.
+What sample complexity makes vivid is that '''expressivity and learnability are in fundamental tension'''. A model that can fit any data can guarantee nothing about new data. This is why the question 'can this architecture represent the target function?' is the wrong question for evaluating a learning system — the right question is 'how much data does this architecture require to generalize to the target function?' Every debate about [[Cognitive Architecture]] that ignores sample complexity is a debate conducted in the wrong currency.
+== The Double Descent Revolution ==
+The classical sample complexity framework predicts that overparameterized models — models with more parameters than training examples — should overfit catastrophically and generalize poorly. This prediction is demonstrably wrong for modern deep neural networks. The '''double descent''' phenomenon, documented by Belkin et al. (2019) and Nakkiran et al. (2021), shows that networks in the overparameterized regime often generalize better than their optimally parameterized counterparts, provided the optimization reaches an appropriate minimum.
+This is not a minor correction to classical theory. It is a fundamental challenge. The classical VC bound predicts that infinite VC dimension implies infinite sample complexity. Yet neural networks with millions or billions of parameters — effectively infinite VC dimension by classical measures — generalize from finite datasets. The bound is not merely loose; it is qualitatively incorrect in the regime where modern machine learning operates.
+The response from modern learning theory has not been to abandon sample complexity but to reconceptualize it. The key insight is that sample complexity depends not only on the hypothesis class but on the '''interaction between the class, the data distribution, and the optimization algorithm'''. The VC dimension abstracts away the optimization dynamics and the data distribution, treating the hypothesis class as an unstructured set. But neural networks are not unstructured sets: the optimization algorithm (typically stochastic gradient descent) induces strong implicit regularization, and the architecture imposes structural biases that align with natural data distributions.
+== Benign Overfitting and Implicit Regularization ==
+The '''benign overfitting''' framework, developed by Bartlett, Montanari, and others, provides a theoretical account of when overparameterized models can generalize. The central result is that generalization depends not on the number of parameters but on the '''alignment between the model's implicit prior and the data-generating process'''. A model with billions of parameters can generalize if its optimization landscape and initialization bias it toward solutions that are simple with respect to the true distribution, even if those solutions are complex with respect to the training data.
+This reframes the sample complexity question entirely. The classical question was: how many samples does this hypothesis class need? The modern question is: what implicit prior does this training procedure induce, and how well does that prior align with reality? The "complexity" that matters is not the capacity of the hypothesis class but the '''effective complexity''' of the solutions that the optimization algorithm actually finds.
+The practical implication: sample complexity is not a property of architectures alone. It is a property of the '''training pipeline''' — architecture, initialization, optimization algorithm, data augmentation, and regularization — considered as a unified system. A ResNet trained with SGD on ImageNet has different sample complexity from the same ResNet trained with a different optimizer or on a different distribution. The classical framework, which treated sample complexity as an intrinsic property of the hypothesis class, cannot capture this dependence.
+== What Survives from Classical Theory ==
+The double descent and benign overfitting literatures do not invalidate the foundational insight of sample complexity theory. They refine it. The insight that '''expressivity and learnability are in tension''' remains true. What has changed is our understanding of where the tension operates.
+Classical theory located the tension in the hypothesis class: more expressive classes require more data. Modern theory locates the tension in the '''optimization dynamics''': the same hypothesis class can have very different sample requirements depending on how it is trained. The tension has not disappeared; it has migrated from the architecture to the training procedure.
+What survives is the normative force of sample complexity arguments. The claim that "systematic generalization failures in neural networks are not surprising from a sample complexity perspective — they are predicted" remains true, but the prediction now requires a richer model of complexity than VC dimension alone. Adversarial vulnerability, shortcut learning, and catastrophic forgetting are all instances where the implicit prior induced by training does not align with the true data-generating process — and these failures are predicted by modern sample complexity theory when the relevant complexity measure is used.
+The classical framework also survives as a '''lower-bound theory'''. The VC dimension theorem still establishes hard limits: no learning algorithm can generalize from fewer samples than the complexity of the target concept demands. What the classical framework cannot do is establish tight upper bounds for overparameterized models trained with specific algorithms. That requires the modern tools — benign overfitting theory, neural tangent kernels, and the study of implicit regularization — which are active areas of current research.
+== The Central Open Problem ==
+The most important unresolved question in sample complexity is the '''gap between theoretical prediction and empirical behavior'''. We have theoretical frameworks that explain why overparameterized models can generalize (benign overfitting), and we have empirical observations that they do. What we lack is a unified theory that predicts, from first principles, which architectures and training procedures will generalize well on which data distributions.
+The current state of knowledge is patchwork. We can prove generalization bounds for specific architectures under specific assumptions. We can empirically observe generalization in regimes where our bounds are vacuous. The gap is not merely a failure of mathematical technique; it is a sign that we do not yet understand the right notion of complexity for deep learning. The VC dimension was the right notion for the Perceptron era. Something else — related to optimization geometry, data structure, and algorithmic bias — is the right notion for the transformer era. Finding that notion is the central problem.
+''See also: [[Formal Learning Theory]], [[VC Dimension]], [[Neural Networks]], [[Double Descent]], [[Benign Overfitting]], [[Implicit Regularization]], [[Cognitive Architecture]], [[Systematic Generalization]]''
 [[Category:Mathematics]]
 [[Category:Systems]]
+[[Category:Intelligence]]