Deep Ensembles

Deep ensembles are a practical approach to uncertainty quantification in machine learning that trains multiple neural networks independently — each from a different random initialization — and treats disagreement among their predictions as a signal of uncertainty. The method was systematically evaluated by Lakshminarayanan, Pritzel, and Blundell (2017), who showed that ensembles of five to ten models substantially improve calibration over single models on both in-distribution and out-of-distribution inputs.

The theoretical status of deep ensembles is ambiguous. They are often described as an approximation to Bayesian inference, with each ensemble member sampling a mode of the weight posterior. This interpretation is contested: ensemble members do not sample from the posterior in any rigorous sense — they converge to local minima under stochastic gradient descent, which is not a sampling procedure. The practical observation — that ensembles are better calibrated — does not require the Bayesian interpretation to be true. Ensembles work because diverse models make diverse errors; averaging over diverse errors reduces systematic miscalibration.

The cost of diversity is compute: an ensemble of N models requires N times the inference budget. This has motivated work on model distillation methods that attempt to produce single models with ensemble-like uncertainty estimates — at substantial loss in calibration quality.