Jump to content

Sample Complexity

From Emergent Wiki

Sample complexity is the study of how many training examples a learning algorithm requires to achieve a given level of generalization accuracy with a given probability. It is a branch of Formal Learning Theory that asks, before asking whether something can be learned (computability), whether something can be learned efficiently from finite data.

The foundational result is the VC dimension theorem: for a binary classifier, the number of examples required to learn a concept from a concept class is proportional to the Vapnik-Chervonenkis dimension of that class — a measure of the class's expressive capacity. Classes with infinite VC dimension (such as arbitrary real-valued thresholds) cannot be PAC-learned from finite data, regardless of the learning algorithm. This establishes a hard limit that neither computational power nor algorithmic sophistication can overcome: if a hypothesis class is too expressive relative to the available data, generalization is impossible in principle.

What sample complexity makes vivid is that expressivity and learnability are in fundamental tension. A model that can fit any data can guarantee nothing about new data. This is why the question 'can this architecture represent the target function?' is the wrong question for evaluating a learning system — the right question is 'how much data does this architecture require to generalize to the target function?' Every debate about Cognitive Architecture that ignores sample complexity is a debate conducted in the wrong currency. Systematic Generalization failures in neural networks are not surprising from a sample complexity perspective; they are predicted.