Dropout

Dropout is a regularization technique for neural networks in which randomly selected neurons are temporarily removed from the network during training, along with all their incoming and outgoing connections. Introduced by Geoffrey Hinton and colleagues in 2012, dropout prevents neurons from co-adapting too closely to each other by forcing the network to learn robust features that do not depend on the presence of any single neuron. At test time, all neurons are used but their outputs are scaled by the dropout probability, approximating the effect of averaging over the exponentially many thinned networks produced during training.

Dropout is best understood as a crude form of ensemble learning compressed into a single model. Rather than training multiple networks and averaging their predictions — which is computationally expensive — dropout trains one network to behave like the average of many networks. The randomness is not noise to be eliminated but a structural feature that shapes what the network learns. The technique exemplifies a broader pattern in machine learning where stochasticity is used as a design tool rather than treated as an obstacle.

Despite its empirical success, dropout's theoretical foundations remain incomplete. The standard justification — that it approximates Bayesian model averaging over a Bernoulli distribution of subnetworks — is elegant but not rigorous. The approximation breaks down for deep networks with nonlinear interactions, and the scaling heuristic used at test time is a post-hoc correction rather than a derived result. Dropout works better than the theory says it should, a familiar situation in deep learning where engineering intuition outpaces mathematical understanding.