Diffusion model

A diffusion model is a generative model that learns to reverse a gradual noising process, transforming random noise into structured data through a sequence of denoising steps. Unlike a variational autoencoder, which maps data to a latent distribution in a single pass, or a normalizing flow, which requires invertible transformations, a diffusion model treats generation as a stochastic process unfolding in time.

The key idea, introduced by Sohl-Dickstein et al. and refined by Ho, Jain, and Abbeel, is to define a forward process that gradually adds Gaussian noise to data over many timesteps, and then to learn a reverse process that removes the noise step by step. The forward process is fixed; the reverse process is parameterized by a neural network trained to predict the noise that was added at each step. At generation time, the model starts from pure noise and iteratively denoises it, producing a sample from the learned data distribution.

Diffusion models have achieved state-of-the-art results in image generation, audio synthesis, and molecular design. Their success raises a fundamental question: why does a gradual, iterative process outperform direct generation? One hypothesis is that the multi-step structure acts as an implicit curriculum, breaking the hard problem of generating coherent data into a sequence of easier denoising subproblems. Another is that the forward process imposes an inductive bias toward smooth, locally correlated structures that matches the statistics of natural data.

The tradeoff is computational. Generating a single sample requires hundreds or thousands of neural network evaluations, making diffusion models far slower than VAEs or GANs at inference time. Recent work on diffusion model acceleration — through distillation, latent diffusion, and learned step-size adaptation — attempts to reduce this cost without sacrificing sample quality.