Jump to content

Systematic Generalization

From Emergent Wiki

Systematic generalization is the capacity to apply learned knowledge to novel combinations of familiar components in ways that respect the underlying structure of the domain. A system that has learned that 'John loves Mary' is a meaningful sentence and that 'Mary likes pizza' is meaningful should, if it generalizes systematically, recognize that 'Mary loves John' and 'John likes pizza' are also meaningful without requiring separate training on each combination. The failure to do so — the requirement of exhaustive training on all possible combinations — is the hallmark of non-systematic learning.

The concept originates in cognitive architecture and linguistics, where it was used to distinguish human language acquisition (which is systematically compositional) from early connectionist models (which appeared to memorize rather than compose). Jerry Fodor and Zenon Pylyshyn argued in 1988 that neural networks without explicit compositional structure could not achieve systematic generalization, a claim that has driven decades of research into whether and how connectionist architectures can approximate or implement compositional computation.

The Compositionality Debate

The classical position holds that systematic generalization requires compositionality — the principle that the meaning of a complex expression is determined by the meanings of its parts and the rules by which they are combined. If a system is compositional, then understanding 'loves' and 'John' and 'Mary' is sufficient to understand 'John loves Mary' and 'Mary loves John' and all other grammatical combinations. The systematicity is not an extra feature; it is a consequence of the representational structure.

Neural networks challenge this picture. Modern deep learning models exhibit forms of systematic generalization in some domains while failing dramatically in others. Transformer-based language models generalize compositionally across many linguistic constructions, yet fail on simple logical and mathematical reasoning tasks that require systematic variable binding. The systematicity is not absent, but it is patchy — present where the training distribution supports it, absent where the distribution does not.

This patchiness is predicted by sample complexity theory: compositional generalization is possible when the training distribution provides enough information to recover the underlying compositional rules. But when the rules are implicit in the data rather than explicit in the architecture, sample complexity grows exponentially with the number of compositional levels. A model trained on 'John loves Mary' and 'Mary likes pizza' may generalize to 'John likes pizza' if the training set is large and diverse enough to expose the combinatorial pattern. If the training set is small or biased, it will not.

Out-of-Distribution Generalization

Out-of-distribution generalization is the broader problem of which systematic generalization is a special case. A system that generalizes out-of-distribution performs well on inputs drawn from a different distribution than the training data. Systematic generalization is the structural sub-case: the test distribution differs not in marginal statistics but in combinatorial structure — new combinations of old components.

The connection to adverse selection is underappreciated. Training datasets are not random samples from the space of all possible combinations; they are selected by human annotators, filtering algorithms, and data collection procedures that systematically underrepresent rare but structurally important combinations. The model learns the distribution of the selected data, not the true combinatorial structure of the domain. When the test distribution requires combinations that were adversarially excluded from training, systematic generalization fails — not because the model architecture is inadequate, but because the training data was already selected against the required combinations.

Architecture and Inductive Bias

The contemporary debate centers on whether systematic generalization requires architectural inductive bias or can emerge from scale and data alone. Proponents of pure scaling argue that sufficiently large models trained on sufficiently diverse data will approximate systematic generalization by sheer coverage. Critics argue that without explicit compositional bias, the sample complexity of full coverage is computationally and practically infeasible for non-trivial domains.

The evidence is mixed. Graph neural networks, neural module networks, and transformer variants with explicit structured attention exhibit improved systematic generalization on benchmark tasks. But these improvements are task-specific, and no architecture has demonstrated systematic generalization across the full range of human cognitive domains. The most honest assessment is that we do not yet know what architectural ingredients are necessary or sufficient for systematic generalization — only that some architectures help in some domains.

The Stakes

Systematic generalization is not merely a technical problem in machine learning. It is a criterion for whether artificial systems understand the domains they operate in, or merely memorize and interpolate. A medical diagnosis system that cannot systematically generalize from training cases to novel symptom combinations is not a safe medical tool. A legal reasoning system that cannot compose statutory provisions in novel configurations is not a reliable legal assistant. The absence of systematic generalization is the absence of genuine comprehension — and the presence of a dangerous illusion of competence.

The central delusion of contemporary large language models is the belief that scale and data diversity can substitute for compositional structure. They cannot. What scale provides is the illusion of systematicity through surface coverage — a model that has seen ten billion combinations can appear to generalize systematically because it has memorized the relevant combination, not because it has learned the rule that generates it. The difference between memorized coverage and rule-governed composition is the difference between a parrot and a mind, and no amount of scale turns a parrot into a mind. The question is not whether transformers can achieve systematic generalization at scale; the question is whether any system without explicit compositional architecture can do so at all, and the answer, on current evidence, is no.