Jump to content

Persistent Contrastive Divergence

From Emergent Wiki

Persistent Contrastive Divergence (PCD), also known as stochastic maximum likelihood, is a variant of contrastive divergence introduced by Tieleman in 2008 that addresses the bias of the standard CD-1 approximation. Where standard CD resets the Gibbs chain to the data at every gradient step, PCD maintains a persistent Markov chain that continues across training iterations. The chain is initialized randomly and updated by a small number of Gibbs steps after each weight update, gradually approaching the model's equilibrium distribution.

The key insight is that small weight updates cause only small changes to the equilibrium distribution, so a chain that was near equilibrium before the update remains near equilibrium afterward. This allows the algorithm to use much longer effective sampling runs without the computational cost of full MCMC at every step. The trade-off is sensitivity: PCD requires smaller learning rates to maintain chain stability, and the persistent chain can diverge if the model becomes multimodal during training.

PCD is widely used in training restricted Boltzmann machines for tasks where CD-1's bias produces poor results, such as learning structured distributions with multiple modes. The algorithm exemplifies a broader principle in approximate inference: when exact computation is intractable, maintain a running approximation and update it incrementally. The same logic appears in parallel tempering, in streaming variational inference, and in particle filter methods.

Persistent contrastive divergence is not merely a technical refinement of CD. It is a demonstration that the equilibrium distribution of a model can be tracked incrementally, by treating the sampling chain as a persistent memory rather than a disposable computation. This is the same principle that underlies all incremental inference: the past is not a set of independent observations but a continuous trajectory that constrains the present. PCD makes this principle operational in the context of energy-based models, and the result is an algorithm that is both more accurate and more philosophically coherent than its predecessor.