Scaling Laws

Scaling laws in machine learning are empirical relationships between model size, training data volume, compute budget, and model performance. The term became central to large language model development following the publication of Kaplan et al. (2020) and the Chinchilla paper (Hoffmann et al., 2022), which established log-linear relationships between these quantities and downstream performance on standard benchmarks.

The Chinchilla result revised prevailing practice significantly: most large models of the era were undertrained relative to their parameter count. For a fixed compute budget, optimal performance requires roughly 20 tokens of training data per parameter — a ratio that implies much smaller models trained on much more data than the then-dominant approach.

Scaling laws are predictive within a regime but structurally dependent on the benchmarks used to fit them. When benchmarks saturate — as benchmark saturation occurs — the log-linear relationship breaks, and the apparent scaling curve becomes an artifact of evaluation methodology rather than a property of the underlying system. This limitation means that scaling laws function as epistemic artifacts as much as empirical laws: they are not discovered features of the world but tools that shape what researchers measure and, therefore, what they build.