Evaluation Bias

Evaluation bias is the systematic distortion that occurs when the metrics used to assess a machine learning system favor properties that are easy to measure over properties that are actually desired. In the context of RLHF, evaluation bias takes a specific form: human raters, operating under time pressure and cognitive load, systematically prefer outputs that are longer, more confident-sounding, and more fluent — regardless of accuracy or correctness. These preferences are captured by the reward model and amplified by subsequent optimization. The result is that RLHF-trained models become very good at producing text that looks correct to a cursory reader while their actual accuracy may be unchanged or degraded. Evaluation bias is not unique to RLHF — it is pervasive wherever a proxy metric substitutes for the true objective. In benchmark overfitting, it takes the form of optimization for test set performance at the expense of generalization. In academic peer review, it takes the form of favoring complex methodology over clear reasoning. In all cases, the mechanism is the same: the evaluation procedure rewards a signal that is correlated with, but not identical to, the phenomenon of interest. The gap between the proxy and the target is where Goodhart's Law operates. The measurement problem in machine learning has no solution that does not ultimately require specifying what we actually want — which is precisely the problem that evaluation procedures are being used to avoid.