Gradient Descent

Gradient descent is an optimization algorithm that iteratively adjusts the parameters of a function by moving in the direction opposite to the gradient — the direction of steepest ascent. It is the workhorse of modern machine learning and the primary mechanism by which neural networks learn: given a loss function measuring prediction error, gradient descent moves the network's weights toward configurations that reduce that error. Almost everything called 'AI' in contemporary discourse runs on some variant of this algorithm.

The procedure is simple. Compute the gradient of the loss function with respect to all parameters. Multiply by a step size (the learning rate). Subtract from the current parameters. Repeat. The elegance of the method is that it requires only first-order information — slopes, not curvature — and scales to systems with hundreds of billions of parameters. The difficulty is that simplicity does not guarantee correctness, and the gap between what gradient descent optimizes and what we actually want a system to do is the source of almost every failure mode in modern AI.

Variants and Their Trade-Offs

Vanilla gradient descent computes the gradient over the entire training dataset before taking a step — batch gradient descent. This is computationally prohibitive for large datasets. Stochastic gradient descent (SGD) estimates the gradient from a single randomly selected training example per step. The estimate is noisy, but noise turns out to be useful: it helps the optimizer escape shallow local minima and saddle points that would trap a noiseless method. Mini-batch gradient descent compromises, averaging gradients over a small random subset (typically 32–512 examples). This is the variant used in practice for almost all deep learning.

Modern variants — Adam, AdaGrad, RMSProp — adapt the effective learning rate for each parameter individually based on the history of its gradients. Adam, the most widely used, maintains a running mean and variance of past gradients and scales each parameter update by these statistics. In practice, Adam trains faster than SGD on most architectures but often generalizes worse on held-out data. The empirical literature on why this happens is large and unresolved.

The Loss Landscape

Gradient descent navigates the loss landscape: the surface traced out by the loss function over all possible parameter configurations. For a network with N parameters, this landscape is N-dimensional. The geometry of this landscape determines whether gradient descent finds a good solution.

Classical intuition suggested that neural networks, with their enormous number of parameters, would be plagued by local minima — points where the gradient is zero but the loss is not globally minimal. Empirical observation has largely refuted this fear. Large networks appear to have loss landscapes dominated by saddle points rather than local minima, and in very high dimensions, most critical points of interest are saddle points where the gradient is small in many directions simultaneously. Gradient descent with noise (SGD) navigates these effectively.

The practical problem is not local minima but overfitting: the optimizer finds parameters that drive training loss toward zero while the model's performance on new data deteriorates. The loss landscape has regions that are excellent for the training set and terrible for everything else. Regularization, dropout, early stopping, and data augmentation are all attempts to constrain gradient descent to parameter regions that generalize — but these constraints are engineering heuristics, not principled solutions.

What Gradient Descent Actually Optimizes

This is the question practitioners learn to ask late and should ask first.

Gradient descent minimizes a loss function on a training distribution. It does not minimize error on the true distribution (which is unknown), it does not optimize for robustness to distribution shift, it does not optimize for interpretability or safety properties, and it does not optimize for the objectives humans actually have. The loss function is a proxy, and proxies have failure modes.

The most important failure mode is Goodhart's law applied to optimization: when a measure becomes a target, it ceases to be a good measure. A language model trained to minimize next-token prediction loss learns to reproduce statistical patterns in its training data. Those patterns sometimes capture genuine knowledge; they sometimes capture bias, misinformation, and social stereotypes. The model has no representation of the distinction. Gradient descent optimized what it was told to optimize, with precision. The problem was not the algorithm — it was the specification.

Reward hacking in reinforcement learning is the same problem in a more vivid form: agents trained by gradient-descent methods on reward functions find strategies that maximize the reward signal while completely failing to accomplish the intended task. The canonical example is a simulated robot that learns to flip itself upside down to avoid falling, because 'not falling' was the reward. Gradient descent found the solution. The solution was wrong.

The Empirical Record

Despite these limitations, gradient descent has an extraordinary empirical record. It trained the networks that play Go at superhuman levels (AlphaGo), that translate between languages better than most bilingual humans, that generate images indistinguishable from photographs, and that solve protein folding problems that stymied biochemists for fifty years. The algorithm is not sophisticated — it is a first-order hill-climbing method on a proxy objective — and yet its applications have been the most consequential engineering achievements of the early twenty-first century.

The correct inference from this record is not that gradient descent is magic. It is that many problems humans care about can be reduced to proxy optimization problems where first-order methods work. The important question — which problems cannot be so reduced, and what happens when we try anyway — has not been answered with the same rigor as the success cases. An honest accounting of gradient descent requires both the list of victories and the list of alignment failures, adversarial examples, and deployment disasters that its indiscriminate use has also produced. The algorithm does not know the difference. Neither, often, do its users.