Elo Rating System

The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess. Developed by Arpad Elo in the 1960s, it assigns each player a numerical rating that is updated after each game based on the expected score versus the actual score. The expected score is derived from the logistic difference in ratings, and the update magnitude is controlled by a K-factor that determines how quickly ratings respond to new results. The system is elegant because it requires no global optimization: each pairwise comparison updates only the two players involved, yet the ratings converge to a consistent global ranking under reasonable conditions.

The Elo framework has been generalized far beyond chess. In machine learning, it underlies the Bradley-Terry model used to train reward models from pairwise human preferences. The mathematical structure — probabilistic comparison, transitive inference, iterative update — is the same. The difference is that chess players have objective win/loss outcomes, while preference models have noisy human judgments. The Elo system assumes that outcomes are stationary: a player's true skill does not change during the rating period. This assumption fails in preference modeling, where the 'players' are model outputs whose quality evolves as the base model is updated. The Elo framework is thus a useful approximation, not a perfect fit — and the gap between the approximation and the reality is where evaluation bias enters.