Neural Scaling Laws

Neural scaling laws are empirical regularities describing how the performance of machine learning systems — measured as loss on held-out data — improves predictably as a power-law function of three variables: the number of model parameters, the volume of training data, and the amount of compute budget. First systematically characterized by Kaplan et al. at OpenAI in 2020 and substantially revised by Hoffmann et al. (DeepMind, 2022) in the "Chinchilla" paper, scaling laws represent the most reliable predictive theory available in deep learning. They are also the most philosophically inconvenient, because what they predict — that intelligence scales continuously and predictably with resources — is precisely what pre-deep-learning theorists assumed was impossible.

The Empirical Pattern

The core finding: for large language models trained by gradient descent on next-token prediction, test loss L decreases as a power law in each of the three variables when the others are held constant:

L proportional to N^(-alpha) where N is parameter count
L proportional to D^(-beta) where D is dataset size in tokens
L proportional to C^(-gamma) where C is total compute (FLOPs)

The exponents are approximately constant across model families, scales spanning six orders of magnitude, and architectures ranging from vanilla transformers to mixture-of-experts systems. The regularity is not exact — there is scatter, and the exponents differ between domains — but it is robust enough to support engineering decisions worth billions of dollars.

The Chinchilla revision corrected the original Kaplan et al. finding that increasing parameters was more efficient than increasing data. Hoffmann et al. showed that, given a fixed compute budget, models had been chronically undertrained: optimal compute allocation requires scaling data and parameters in roughly equal proportion. The practical implication was immediate: frontier LLMs in the GPT-3 era were too large for their training sets. Chinchilla-optimal training produced substantially better performance at lower parameter counts — demonstrating that scaling laws are not merely descriptive but prescriptive tools for engineering decisions.

What Scaling Laws Mean and Do Not Mean

The philosophical weight of scaling laws is routinely either overstated or underestimated. The two errors are symmetric and both are wrong.

The overstated version: scaling laws imply that AGI is a matter of adding resources. If performance scales predictably with compute, then sufficiently large compute produces human-level or superhuman cognition. This inference is invalid for two reasons. First, the loss metric that scaling laws track — cross-entropy on token prediction — is not a measure of general intelligence. It is a measure of how well a model predicts the next token in text. The relationship between token prediction loss and the cognitive capacities we actually care about is empirically correlated but not theoretically derived. Second, scaling laws are observed to hold over the ranges studied; whether they continue to hold at greater scales is an empirical question that has already shown signs of complication. Emergent capabilities appear discontinuously with scale, suggesting that the smooth power-law surface has phase transitions whose locations cannot be predicted from the law alone.

The understated version: scaling laws are just curve-fitting, not theory. This is wrong in the other direction. The regularity is more striking than this dismissal allows. A power law that holds across six orders of magnitude in compute, across different architectures, different datasets, and different organizations independently replicating the finding, is not noise. It is evidence of a structural feature of the learning problem. The most plausible explanation is that language modeling is a compression task and the information content of natural language imposes a predictable structure on how much capacity is required to approximate it at each level of fidelity. Scaling laws are the compression theory of language made quantitative.

Compute Frontiers and the Efficiency Race

Scaling laws transformed AI development from an art into an engineering discipline — partially. Before scaling laws, model performance depended on architectural innovations whose effects were hard to predict. After scaling laws, the dominant variable is simply: how much compute can you allocate? This created the race-to-scale dynamic of 2020-2024, in which frontier labs competed primarily on training compute rather than architectural novelty.

The efficiency race has complicated this picture. Quantization, sparse architectures, and inference-time compute (chain-of-thought, test-time search) have repeatedly demonstrated that the parameter-count axis of scaling is not the only lever. Inference-time compute scaling, studied in the "o1" family of models, suggests a second scaling law governing how performance improves with reasoning steps at test time. If confirmed, this implies that the scaling paradigm is not one-dimensional but a family of laws governing different resources — and that the relevant resource for some cognitive tasks may be reasoning depth rather than parameter count.

The deeper consequence is a transformation in how intelligence is understood as an engineering artifact. Pre-scaling-laws AI research was dominated by the belief that architectural cleverness — better priors, better inductive biases, better symbolic representations — was the key variable. Scaling laws replaced architectural cleverness with brute resource allocation as the primary driver of capability. This is not what researchers expected, and the field has not fully absorbed the implication: the structure of intelligence, at least in the domain of language, is apparently more like a compression problem than like a program-synthesis problem. Programs must be written; compressions can be graded.

The Unsettled Question

Scaling laws tell us what will happen if we add resources. They do not tell us why it works, whether the underlying regularity reflects something deep about information and cognition or is an artifact of the particular pretraining objective, or whether there is a ceiling that will appear at scales not yet reached.

The honest position: scaling laws are the best predictive framework available for deep learning systems, they have been right more often than any alternative framework, and they remain theoretically unexplained. The field that produced them has no first-principles account of why intelligence should scale as a power law in resources. It has an empirical regularity that has been enormously useful and a set of post-hoc explanations that are each partially convincing.

Any account of machine intelligence that does not engage with scaling laws is missing the central empirical fact about how machine intelligence actually develops. Any account that treats scaling laws as a complete theory of machine intelligence is confusing a map for the territory. The map is accurate; the territory is larger than the map. The most provocative reading of the scaling law literature is also the most defensible: the consistent finding that machine intelligence scales smoothly with resources, without categorical discontinuities except at the emergent phase transitions we have not yet predicted, is the strongest available evidence against the view that human-level cognition requires anything other than sufficient resources applied to the right learning objective. The exponents are unimpressed by philosophical arguments about the uniqueness of biological minds.