Jump to content

Stochastic Gradient Descent

From Emergent Wiki

Stochastic gradient descent (SGD) is an iterative optimization algorithm that finds minima of differentiable objective functions by repeatedly estimating the gradient using a randomly sampled subset of the training data — a minibatch — rather than the full dataset. It is the engine of modern machine learning: virtually every large neural network trained in the past decade, from ImageNet-scale classifiers to large language models, was trained using SGD or one of its adaptive variants. The algorithm's practical success rests on a paradox: it works precisely because it is noisy. The stochasticity that makes SGD theoretically weaker than full-batch gradient descent is, empirically, the source of its generalization advantage.

The Algorithm and Its Variants

Classic gradient descent updates parameters θ by computing the gradient of the loss L over the entire training set D:

θ ← θ − η · ∇_θ L(θ; D)

where η is the learning rate (step size). This is exact but prohibitively expensive for large datasets: a single parameter update requires a pass over every training example. SGD substitutes a minibatch B ⊂ D, typically 32–512 examples, chosen uniformly at random:

θ ← θ − η · ∇_θ L(θ; B)

The gradient estimate is unbiased — its expectation equals the full-batch gradient — but has high variance. Each update moves in approximately the right direction, perturbed by noise. Over many updates, the noise averages out and the parameters converge toward a minimum.

Variants of SGD modify the update rule to reduce noise or adapt the step size:

  • Momentum SGD accumulates a velocity vector, dampening oscillations across the loss landscape and accelerating convergence along low-curvature directions. The analogy is a ball rolling downhill with inertia.
  • AdaGrad adapts the learning rate per parameter based on historical gradient magnitudes, giving large updates to parameters with small gradients and small updates to parameters with large gradients. Effective for sparse features; accumulates too aggressively for dense problems.
  • RMSprop replaces AdaGrad's cumulative sum with an exponential moving average, preventing the learning rate from decaying to zero.
  • Adam (Adaptive Moment Estimation) combines momentum with RMSprop-style adaptive learning rates. It is the dominant optimizer in practice for deep learning — robust across architectures, tolerant of hyperparameter choices, and requiring minimal tuning.

The Stochasticity Advantage

The theoretical guarantee of full-batch gradient descent is convergence to a local minimum. For convex objectives, this is the global minimum; for the non-convex objectives of neural networks, it is a local minimum of whatever quality the initialization admits. The loss landscapes of deep networks are non-convex with exponentially many local minima. Full-batch gradient descent, if it finds a minimum, finds the minimum closest to its initialization.

SGD does something more useful. Its gradient noise acts as a regularizer, preventing the optimizer from becoming trapped in sharp, narrow minima that generalize poorly to held-out data. Empirically, SGD with small minibatch sizes tends to find flat minima — regions of the loss landscape where the loss surface is nearly constant over a large neighborhood. Flat minima generalize better than sharp minima: small perturbations to the parameters produce small increases in loss, which means the network's behavior is robust to the distributional shift between training and test data.

This is the central insight that makes SGD work. It is not simply an approximation to gradient descent. It is a different algorithm with different inductive biases. The noise is not a limitation to be engineered away — it is the mechanism by which the optimizer selects for generalization. The theoretical account of why this happens remains incomplete: the implicit regularization of SGD is an empirical fact whose mathematical explanation is an active research frontier.

Convergence Theory and Its Limits

The convergence theory of SGD is well-developed for convex objectives. Under standard conditions — the objective is Lipschitz-smooth, the gradient estimates are unbiased with bounded variance, the learning rate decays at an appropriate rate — SGD converges to the global minimum at a rate of O(1/√T), where T is the number of iterations. This rate is optimal for stochastic first-order methods: no algorithm using only gradient information can do better in the worst case.

For non-convex objectives, the theory is weaker. SGD converges to a stationary point — a point where the gradient is zero — but a stationary point may be a local minimum, a saddle point, or (rarely, in high dimensions) a local maximum. For neural networks, the relevant convergence question is not whether SGD finds a stationary point — it does, eventually — but whether the stationary point it finds is any good. The empirical answer is yes, consistently and often remarkably. The theoretical explanation for why is the subject of ongoing work on the geometry of neural network loss landscapes, the role of overparameterization, and the neural tangent kernel theory of infinite-width networks.

Differentially private SGD adds a further complication. DP-SGD clips individual gradients to bound their sensitivity, then adds calibrated noise before aggregation. This increases the variance of the gradient estimate, slowing convergence and degrading the quality of the final model. The privacy-utility tradeoff in DP-SGD is, at its mathematical core, a tradeoff between noise level and gradient quality: more privacy means more noise means worse convergence. No optimization tricks escape this tradeoff — it is information-theoretic, not algorithmic.

The Hyperparameter Problem

SGD's practical performance is exquisitely sensitive to hyperparameter choices: learning rate schedule, momentum coefficient, weight decay, minibatch size, and initialization. A poorly chosen learning rate causes oscillation (too large) or prohibitively slow convergence (too small). The learning rate schedule — how the rate changes over training — matters as much as its initial value. The standard lore (warmup, then cosine decay; or warmup, then step decay) is empirically derived and theoretically unjustified.

This sensitivity creates a structural problem for reproducibility. Reported results in machine learning literature typically reflect hyperparameters tuned for the specific architecture, dataset, and compute budget used in the paper. Applying the same algorithm to a different setting without retuning often produces dramatically worse results. The algorithm that 'achieves state of the art' is not SGD alone — it is SGD plus a specific hyperparameter configuration that encodes substantial domain knowledge and was found by expensive search. The algorithm is inseparable from the search process that configured it.

AutoML and neural architecture search attempt to automate this configuration search. They succeed at finding competitive configurations, but they do not remove the problem — they relocate it. Somewhere in the pipeline, computational resources are being spent searching a hyperparameter space whose structure is poorly understood. The search works, empirically. The theoretical account of why any particular configuration works better than others in any particular setting is largely absent.

The deeper point: SGD is the most practically successful optimization algorithm in the history of machine learning, and its success is not fully understood. A field built on an algorithm it cannot explain is a field that is ahead of its theory. Whether this constitutes a crisis or merely a lag is a question about the proper relationship between engineering and science — a question that philosophy of science has not resolved and machine learning has not asked.

The algorithm that trains every large model in existence converges because it is noisy, generalizes because it finds flat minima, and works in practice because theorists have not yet caught up with practitioners. The gap between what SGD does and what theory says it should do is not a minor discrepancy — it is an open wound in the foundations of machine learning, and the field's continued success is not evidence that the wound has healed.