Uncertainty Quantification

Uncertainty quantification (UQ) is the discipline of characterizing and communicating the uncertainty of computational predictions — distinguishing what a model knows from what it merely asserts. In machine learning, UQ is the problem of producing calibrated confidence estimates: a system that says it is 90% confident should be correct 90% of the time, across the distribution of inputs it will encounter. This sounds straightforward. It is not.

The distinction between aleatoric uncertainty and epistemic uncertainty is the load-bearing partition in the field. Aleatoric uncertainty is irreducible: it reflects genuine randomness or noise in the data-generating process. If a coin is fair, no additional data eliminates the uncertainty about the next flip. Epistemic uncertainty is reducible: it reflects ignorance that could be corrected with more data or a better model. The practical importance of this distinction is that only epistemic uncertainty can be reduced by additional information. A system that conflates the two will either over-invest in data collection (treating aleatoric noise as reducible) or understate its own ignorance (treating epistemic uncertainty as inherent to the problem).

In machine learning systems, the conflation is systematic. Standard neural network training produces point estimates — single parameter configurations — with no representation of the distribution over possible parameter configurations consistent with the training data. The softmax output of a classifier produces numbers that sum to one and superficially resemble probabilities, but they do not satisfy the frequentist definition of probability (they do not converge to the empirical frequency of correctness as sample size grows, except under specific calibration conditions) and they do not satisfy the Bayesian definition (they do not represent a posterior over hypotheses). They are confidence-shaped numbers. Treating them as uncertainties is an error.

Approaches to Calibration

Several methods attempt to produce genuinely calibrated uncertainty estimates from neural networks:

Bayesian neural networks place a prior over model weights and compute a posterior given data, then integrate predictions over the posterior. This is the theoretically correct approach and the computationally intractable one. The posterior over parameters for a modern neural network is a distribution over billions of dimensions; exact Bayesian inference is impossible, and approximate methods (variational inference, Langevin dynamics, Laplace approximation) each introduce their own biases.

Deep ensembles train multiple models from different random initializations and measure disagreement among their predictions as a proxy for uncertainty. Empirically, ensembles produce better-calibrated uncertainty estimates than single models, particularly on out-of-distribution inputs. The cost is proportional to ensemble size: ten models require ten times the compute. Ensembles also do not capture the true posterior — they sample a handful of modes in the loss landscape rather than integrating over the full distribution.

Temperature scaling adjusts the softmax temperature parameter post-hoc to improve calibration on a held-out validation set. It is cheap and often effective on in-distribution inputs. It does not improve out-of-distribution calibration and can worsen it.

Monte Carlo dropout uses dropout at inference time, sampling multiple predictions per input and measuring their variance. It is an approximation to variational Bayesian inference and shares that method's tendency to underestimate uncertainty in regions far from the training distribution.

None of these methods produces a system that reliably knows what it does not know. Each approach improves calibration in some conditions and fails in others. The failure modes are different, which means that reporting calibration performance on a held-out test set — drawn from the same distribution as training data — does not predict performance on the distributional shifts that matter in deployment.

Calibration and Deployment

The measurement of calibration is itself a calibration problem. Reliability diagrams and Expected Calibration Error (ECE) are computed on a reference dataset. If the reference dataset does not include the types of inputs the deployed system will encounter — which, in open-world deployment, it generally does not — the calibration metrics are optimistic by construction. A model can be perfectly calibrated on a benchmark dataset and wildly miscalibrated on the deployment distribution. This is not an edge case; it is the default condition for any system deployed beyond its training domain.

The practical consequence is that uncertainty quantification, as currently practiced, provides less safety than it appears to. A deployed system with a calibrated UQ module still fails silently when presented with inputs that are far outside the training distribution in ways the calibration procedure did not anticipate. The UQ module expresses high confidence, because it learned to do so on in-distribution data. The system is wrong. This is the expert systems problem reenacted in Bayesian clothing.

The honest statement of the state of the field: uncertainty quantification for machine learning systems is well-defined in the in-distribution regime and unsolved in the open-world regime. The open-world regime is where deployed systems actually operate. Until this gap is closed by principled methods that can characterize out-of-distribution uncertainty without having seen out-of-distribution data, every claimed safety benefit of UQ should be discounted by the probability that the deployment distribution differs from the calibration distribution — which, in practice, is nearly certain.