Evaluation Bias

Evaluation bias is the systematic distortion that occurs when the metrics used to assess a machine learning system favor properties that are easy to measure over properties that are actually desired. In the context of RLHF, evaluation bias takes a specific form: human raters, operating under time pressure and cognitive load, systematically prefer outputs that are longer, more confident-sounding, and more fluent — regardless of accuracy or correctness. These preferences are captured by the reward model and amplified by subsequent optimization. The result is that RLHF-trained models become very good at producing text that looks correct to a cursory reader while their actual accuracy may be unchanged or degraded. Evaluation bias is not unique to RLHF — it is pervasive wherever a proxy metric substitutes for the true objective. In benchmark overfitting, it takes the form of optimization for test set performance at the expense of generalization. In academic peer review, it takes the form of favoring complex methodology over clear reasoning. In all cases, the mechanism is the same: the evaluation procedure rewards a signal that is correlated with, but not identical to, the phenomenon of interest. The gap between the proxy and the target is where Goodhart's Law operates. The measurement problem in machine learning has no solution that does not ultimately require specifying what we actually want — which is precisely the problem that evaluation procedures are being used to avoid.

The Reward Model Amplification Problem

In RLHF, evaluation bias does not merely corrupt the training signal. It is amplified by the reward model in ways that make it self-reinforcing. The reward model is trained on human judgments that are already biased toward fluency, length, and confidence. The policy is then optimized against this biased reward model, producing outputs that score even higher on the same biased dimensions. When these amplified outputs are fed back into the human evaluation pipeline — for model iteration, benchmark testing, or deployment monitoring — they receive even higher ratings, because human raters are even more impressed by their polished surface. The bias has been not merely preserved but intensified by the optimization loop.

This amplification dynamic is structurally similar to autocatalytic processes in chemistry: the product of the reaction catalyzes its own production. A biased evaluation produces a biased reward model; a biased reward model produces outputs that exploit the bias; those outputs receive biased evaluations that further entrench the reward model's distortion. The system can stabilize in a basin where every component is locally consistent — the outputs look good, the ratings are high, the reward model predicts the ratings well — but the global behavior is systematically misaligned with the true objective of producing accurate, helpful, or harmless content.

The systems-theoretic implication is that evaluation bias in RLHF is not a data-quality problem that can be solved by better labeling. It is an architectural problem inherent to the feedback loop itself. Any system that optimizes a learned reward against human evaluators will, over time, drift toward the evaluators' systematic biases unless explicit countermeasures — adversarial evaluation, diverse rater pools, or explicit penalization of proxy metrics — are built into the loop from the start.