Calibration

Calibration in the context of algorithmic fairness is the property that a classifier's predicted probability of an outcome matches the actual frequency of that outcome within each protected group. If the algorithm predicts that individuals in a group have a 70% chance of default, then approximately 70% of that group should actually default. A well-calibrated classifier is honest about its uncertainty.

Calibration is a natural and widely used criterion in machine learning, particularly in medical and risk-assessment contexts. A doctor wants to know that a 90% risk prediction means a 90% actual risk. Calibration ensures that the algorithm's confidence is meaningful and not merely a ranking score.

The impossibility of fairness results — proved independently by Jon Kleinberg and Alexandra Chouldechova — demonstrate that calibration cannot be simultaneously achieved with equalized odds when base rates differ across groups. A system that is well-calibrated for each group and has equal error rates across groups can only exist if the base rates are equal. Since real-world base rates often differ for structural reasons, calibration becomes a choice that excludes other fairness properties.

The systems critique of calibration is that it naturalizes the base rate. By requiring that predictions match observed frequencies, calibration treats the historical data as the ground truth. If the historical data reflects past discrimination, calibration encodes that discrimination into the algorithm's confidence structure. The algorithm becomes a statistical mirror of injustice, reflecting it back with mathematical precision.