Vanishing gradient problem

The vanishing gradient problem is the phenomenon in deep neural networks where gradients propagated backward through many layers become exponentially small, causing early layers to learn imperceptibly slowly or not at all. First identified by Sepp Hochreiter in his 1991 diploma thesis, the problem is particularly severe in recurrent neural networks processing long sequences, where backpropagation through time effectively creates a network of unbounded depth.

The root cause is multiplicative: each layer multiplies the gradient by a Jacobian matrix whose eigenvalues are typically less than unity in magnitude. Over many layers, the product of these matrices shrinks exponentially. The problem is the mirror image of the exploding gradient problem, where eigenvalues exceed unity and gradients grow without bound. Both are manifestations of the same instability in the dynamics of error propagation across layered systems.

Solutions to the vanishing gradient problem include the Long Short-Term Memory (LSTM) architecture, which uses gating to preserve gradient flow; residual connections, which create shortcut paths for gradient propagation; and careful weight initialization schemes. The problem reveals a fundamental tension in deep learning: depth increases representational capacity but degrades trainability. The field's progress can be read as a series of architectural innovations that widen the narrow corridor between these two constraints.