Contrastive divergence

Contrastive divergence (CD) is an approximate learning algorithm for restricted Boltzmann machines (RBMs) and related energy-based models, introduced by Geoffrey Hinton in 2002. It provides a fast alternative to maximum likelihood estimation, which requires computing the model's partition function — an intractable sum over all possible configurations. CD sidesteps this intractability by running Gibbs sampling for only a few steps (typically one) rather than waiting for the chain to converge to equilibrium.

The core idea is deceptively simple: instead of comparing the data distribution to the model's equilibrium distribution, CD compares the data to a distribution obtained by taking a few Gibbs steps from the data itself. The difference between these two distributions approximates the gradient of the log-likelihood. The algorithm is not exact — the approximation introduces bias — but it is computationally feasible, and in practice it produces useful gradients for training deep generative models.

The Algorithm

For an RBM with visible units v and hidden units h, the maximum likelihood gradient of the log-likelihood with respect to a weight w_{ij} is:

Δw_{ij} = ⟨v_i h_j⟩_{data} - ⟨v_i h_j⟩_{model}

where the first term is the expectation under the data distribution and the second term is the expectation under the model's equilibrium distribution. The second term is intractable because it requires summing over all possible visible configurations.

Contrastive divergence approximates the second term by replacing the model equilibrium with a short Gibbs chain starting from the data. In CD-1 (the most common variant), the algorithm: 1. Samples hidden units from the data: h ~ P(h|v_data) 2. Reconstructs visible units: v' ~ P(v|h) 3. Samples hidden units again: h' ~ P(h|v')

The weight update becomes: Δw_{ij} ≈ ⟨v_i h_j⟩_{data} - ⟨v'_i h'_j⟩_{reconstruction}

This is the difference between the data statistics and the reconstruction statistics. The algorithm pushes the model to make reconstructions that resemble the data, which is equivalent to increasing the probability of the data under the model.

Approximation, Bias, and Persistent Contrastive Divergence

CD is not unbiased. The short Gibbs chain does not reach equilibrium, so the reconstruction statistics differ from the true model statistics. The bias is small when the data distribution is close to the model distribution — which is true early in training — but can become significant later. Despite this, CD works remarkably well in practice.

Several variants have been developed to reduce the bias. Persistent Contrastive Divergence (PCD), introduced by Tieleman in 2008, maintains a persistent Markov chain that is not reset to the data at each gradient step. The chain runs for many steps across training iterations, gradually approaching equilibrium. PCD provides more accurate gradients but requires careful tuning of the learning rate to maintain chain stability.

Other variants include parallel tempering, which runs multiple chains at different temperatures to improve mixing, and mean-field contrastive divergence, which replaces sampling with mean-field approximations. Each variant trades off computational cost against approximation accuracy.

Connections to Broader Frameworks

Contrastive divergence can be understood as a special case of several broader principles. From the perspective of Markov Chain Monte Carlo, it is a truncated MCMC algorithm that sacrifices asymptotic exactness for computational speed. From the perspective of the Free Energy Principle, it minimizes an approximation to the variational free energy: the reconstruction error is a bound on the surprise of the data, and the algorithm updates parameters to reduce this bound.

The connection to the Wake-Sleep Algorithm is also illuminating. Both are approximate learning algorithms for generative models, but whereas wake-sleep uses separate recognition and generative networks trained with different objectives, CD trains a single symmetric network. The RBM's bipartite structure makes this possible: the same weights that map visible to hidden also map hidden to visible, so the forward and backward passes use the same parameters.

The Legacy of Contrastive Divergence

Contrastive divergence enabled the 2006 deep learning revival by making deep belief networks trainable. Without CD, the RBM layer-wise pre-training that launched the deep learning revolution would have been computationally infeasible. The algorithm's success demonstrated that approximate inference — inference that is formally biased but practically effective — can be more useful than exact methods that are too slow to apply.

The broader lesson is that the gradient of the log-likelihood is not the only useful training signal. Any update rule that pushes the model toward better data reconstructions, even if it is not a true gradient, can produce useful representations. This insight underlies modern self-supervised learning: contrastive learning in vision, masked language modeling in NLP, and diffusion models in generative AI all use approximate objectives that are not exact likelihoods but are effective at learning structure.

The success of contrastive divergence is often attributed to the fact that it is 'good enough' for training RBMs. This misses the deeper point. CD is not a flawed approximation to a correct algorithm; it is a correct algorithm for a different problem. The problem is not 'find the exact maximum likelihood parameters' but 'find parameters that make the model's reconstructions resemble the data after one step of Gibbs sampling.' This is a well-defined optimization problem with its own geometry, and its solution happens to produce useful representations. The field's obsession with exact likelihoods and unbiased gradients has blinded it to a larger truth: approximate objectives can have better optimization landscapes than exact ones. Contrastive divergence worked not despite its approximation but because of it — because the approximate objective had fewer spurious local minima and better generalization properties than the true likelihood. This is the same principle that explains why stochastic gradient descent outperforms batch gradient descent, why dropout improves generalization, and why self-supervised learning rivals supervised learning. The noise is not the bug. It is the feature.