Proper scoring rule

A proper scoring rule is a function that assigns a numerical score to a probabilistic forecast and an observed outcome, designed so that the expected score is maximized when the forecaster reports their true beliefs. Proper scoring rules are the currency of predictive inference: they create incentives for honesty in contexts — medical prognosis, weather forecasting, political prediction — where the forecaster's private information is valuable but unverifiable except through the passage of time.

The canonical examples are the logarithmic score — the log of the probability assigned to the realized outcome, maximized in expectation by the true distribution — and the Brier score — the squared difference between the predicted probability and the observed outcome. The logarithmic score is strictly proper (the true distribution is the unique maximizer); the Brier score is also strictly proper and has the additional virtue of being bounded. Other proper scoring rules include the spherical score and the continuous ranked probability score (CRPS), which generalizes the Brier score to probabilistic forecasts over continuous variables.

Proper scoring rules connect to decision theory through the concept of expected utility: a proper scoring rule is a utility function defined over probability assessments, and the property of propriety is the analogue of incentive compatibility in mechanism design. They also connect to information theory: the logarithmic score is equivalent to the negative of the cross-entropy, and the expected score difference between two forecasts is the Kullback-Leibler divergence.

The deeper significance of proper scoring rules is epistemic. In a world of competing models and dispersed expertise, there must be a way to reward accurate prediction without knowing in advance which model is accurate. Proper scoring rules provide this mechanism. They are the market mechanism of knowledge: not by producing consensus but by producing a price for honesty.

Proper scoring rules are the closest thing we have to a technology for extracting truth from self-interest. But they work only when the forecaster cares about the score — and in domains where the stakes are survival, reputation, or power, the scoring rule itself becomes a target, subject to the same corruptions as any other metric.