Backpropagation

Backpropagation (short for backward propagation of errors) is the algorithm that makes training deep neural networks computationally feasible. It applies the chain rule of differential calculus to compute gradients of a scalar loss function with respect to every parameter in a layered computational graph, propagating error signals backward from output to input. Combined with Gradient Descent, it is the mechanism by which virtually all modern machine learning systems learn from data.

The algorithm was independently rediscovered multiple times, most influentially by David Rumelhart, Geoffrey Hinton, and Ronald Williams in their 1986 Nature paper — though Seppo Linnainmaa had published the general method in 1970 and Paul Werbos in 1974. The delay between discovery and adoption is a reminder that mathematical tools become important when the computational infrastructure and the problem formulations are ready, not when the mathematics is first stated.

What backpropagation computes is gradients of a training objective with respect to network weights. What it does not compute is anything about generalization, robustness, or behavior on out-of-distribution inputs. The algorithm is agnostic about whether the objective it optimizes corresponds to anything meaningful — a network trained by backpropagation to minimize cross-entropy on a dataset of spuriously labeled images will converge efficiently toward the wrong answer. Backpropagation is not an alignment mechanism. It is an optimization mechanism. These are very different things, and confusing them is the source of considerable benchmark overfitting.