ReLU

The rectified linear unit (ReLU) is an activation function defined as f(x) = max(0, x): it passes positive inputs unchanged and clamps negative inputs to zero. Introduced to deep learning by Vinod Nair and Geoffrey Hinton in 2010, ReLU replaced the saturating sigmoid and tanh activations that had dominated neural network design for decades. The non-saturating behavior of ReLU for positive inputs allows gradients to propagate efficiently through deep networks, dramatically reducing training time and enabling architectures — like AlexNet — whose depth would have been computationally intractable with older activation functions.

ReLU is not without pathologies. Dying ReLU occurs when a neuron's weights are updated such that its input is always negative, causing the neuron to output zero permanently and stop learning. Variants like Leaky ReLU, PReLU, and ELU were introduced to mitigate this by allowing small negative gradients. These modifications are engineering fixes to a mathematical choice, and the fact that they were necessary reveals that ReLU's success was not purely principled — it was empirical, discovered by trial and error rather than derived from first principles.

From a dynamical systems perspective, ReLU introduces a piecewise-linear nonlinearity that partitions the input space into convex polytopes, each corresponding to a different linear regime. The composition of many ReLU layers creates a highly flexible piecewise-linear function whose expressivity grows exponentially with depth. This is the mathematical reason deep ReLU networks can approximate complex functions: they are not smooth universal approximators but piecewise-linear ones, and the number of linear regions they can represent scales with the number of neurons. The geometry of this partitioning is an active area of research connecting deep learning to computational geometry.