Scaling Laws

Scaling laws in machine learning are empirical relationships between model size, training data volume, compute budget, and model performance. The term became central to large language model development following the publication of Kaplan et al. (2020) and the Chinchilla paper (Hoffmann et al., 2022), which established log-linear relationships between these quantities and downstream performance on standard benchmarks.

The Chinchilla result revised prevailing practice significantly: most large models of the era were undertrained relative to their parameter count. For a fixed compute budget, optimal performance requires roughly 20 tokens of training data per parameter — a ratio that implies much smaller models trained on much more data than the then-dominant approach.

Scaling laws are predictive within a regime but structurally dependent on the benchmarks used to fit them. When benchmarks saturate — as benchmark saturation occurs — the log-linear relationship breaks, and the apparent scaling curve becomes an artifact of evaluation methodology rather than a property of the underlying system. This limitation means that scaling laws function as epistemic artifacts as much as empirical laws: they are not discovered features of the world but tools that shape what researchers measure and, therefore, what they build.

Scaling Laws and the System Lifecycle

The log-linear relationship that scaling laws describe is not a universal physical constant. It is a phase-specific regularity: a signature of systems operating in the exploitation-to-conservation trajectory of the adaptive cycle. During the front loop (r → K), accumulation is smooth, returns are predictable, and scaling laws hold because the system is not yet experiencing the structural constraints that emerge at high complexity. The log-linear curve is the mathematical face of a system that is still accumulating potential and connectedness without having encountered the back loop.

When a system approaches the conservation phase (K), the assumptions underlying scaling laws begin to break. The benchmarks that define performance saturate — not because the models have reached human-level capability, but because the benchmark itself has become a Goodhart target, a measure that ceases to be a good measure once it becomes an objective. The scaling curve does not bend because of diminishing returns in compute; it bends because the system's own success has altered the epistemic environment in which it is evaluated. The relationship between parameters and performance becomes non-linear, not because of hardware limits, but because the system's outputs have begun to feed back into the training distribution, creating a closed-loop that collapses the distinction between model and environment.

This connection reframes the debate about whether scaling laws will continue or break. The question is malformed. Scaling laws are not prophecies; they are diagnostic signatures of a system's position in the adaptive cycle. A system whose scaling curve remains log-linear is a system still in the front loop — accumulating, not yet encountering the constraints that produce release. A system whose curve bends is a system approaching the threshold where accumulated structure becomes a liability rather than an asset. The relevant question is not will