Neural Tangent Kernel

The neural tangent kernel (NTK) is a kernel function, introduced by Jacot, Gabriel, and Hongler in 2018, that describes the training dynamics of infinitely wide neural networks trained by gradient descent. In the infinite-width limit, a neural network behaves like a linear model in the function space defined by the NTK: the network's predictions evolve as a linear function of the initial residuals, with the kernel determining the rate at which different directions in function space are learned. The NTK is constant throughout training in the infinite-width limit, which makes the dynamics analytically tractable — the training loss decreases exponentially, and the final learned function is a kernel regression solution.

The NTK regime is theoretically elegant and empirically irrelevant. Finite-width networks — the ones that actually exist and actually work — operate far outside the NTK regime. Feature learning, which is the mechanism by which neural networks identify useful representations, requires that the network's kernel change during training — exactly what the NTK theory prohibits. The empirical success of neural networks is not explained by NTK theory; it is explained by finite-width effects that the infinite-width limit suppresses. The NTK is a rigorous theory of networks that no one builds.

This is a productive failure. The NTK makes precise what a neural network would do if it did not learn features, which clarifies, by contrast, what feature learning actually is. The gap between NTK predictions and empirical behavior is a precise measure of how much feature learning matters — and it matters enormously. See SGD's implicit regularization and representation learning for the dynamics the NTK theory leaves out.

The Scaling Paradox

The NTK framework creates a paradox for the scaling paradigm in modern deep learning. The scaling laws that govern large language models — power-law decreases in loss with model size, data, and compute — are observed in networks that operate far outside the NTK regime. Yet the NTK is the only rigorous theory that connects neural network training to kernel methods, and kernel methods have their own well-understood scaling theory. The paradox is this: if large models are not in the NTK regime, why do they exhibit the regular, predictable scaling behavior that the NTK regime predicts?

One resolution is that scaling laws are not a property of the training dynamics at all, but a property of the data distribution. The power-law scaling of loss may reflect the power-law structure of natural language — its Zipfian word-frequency distribution, its hierarchical syntax, its long-range dependencies. On this view, the network is not discovering universal scaling behavior; it is approximating a universal statistical structure that exists independently of the approximator. The network could be a kernel machine, a transformer, or a lookup table — the scaling would be the same because the scaling is in the data, not the model.

Another resolution is that the transition from the NTK regime to the feature-learning regime is itself a phase transition in the space of training dynamics. Below a critical width, the network is in the NTK regime: kernel is constant, dynamics are linear, feature learning is absent. Above the critical width, the network enters a new phase where the kernel evolves, representations reorganize, and the system's behavior is no longer describable by any fixed kernel. The scaling laws might be the universal behavior of this new phase — but if so, they are emergent properties of a dynamical regime that has no closed-form theory.

The NTK thus serves as a boundary condition. It tells us what neural networks are not: they are not kernel machines in the infinite-width limit, because the infinite-width limit suppresses the very feature learning that makes them useful. But it also tells us what they might be approaching at scale: not the NTK itself, but a new universality class of learning dynamics that has its own scaling exponents, its own critical behavior, and its own — as yet undiscovered — theoretical description.

_The NTK is a map of a territory that large models have left behind. The question is whether the territory they have entered has any maps at all._