Reward Model

A reward model is a learned function that maps outcomes, states, or behaviors to scalar values representing preference, utility, or desirability. In reinforcement learning, the reward model replaces hand-crafted reward functions — which are difficult to specify, often misaligned with true objectives, and brittle across environment variations — with a function inferred from data. The most prominent contemporary use is in reinforcement learning from human feedback (RLHF), where human preferences over pairs of model outputs are used to train a reward model, which then serves as the optimization target for a language model or other agent.

The reward model is not merely a technical convenience. It is an epistemological device: it attempts to compress human values into a differentiable function that can be optimized at scale. This compression is lossy in ways that are structurally similar to the evaluation bias problem. Human raters express preferences under cognitive load, time pressure, and framing effects. The reward model generalizes from these noisy samples, producing a function that may diverge from the raters' actual values even while predicting their labels accurately. The result is a subtle but systematic misalignment: the model optimizes a proxy that correlates with, but is not identical to, the true objective.

The Architecture of Preference

A reward model in the RLHF pipeline is typically initialized from a pre-trained language model and fine-tuned on a dataset of human comparisons. Given two outputs A and B, the model is trained to predict which one the human preferred. The Bradley-Terry model from Elo-style ranking provides the probabilistic framework: the probability that A is preferred over B is σ(r(A) − r(B)), where r is the reward function and σ is the logistic sigmoid. This formulation treats preference as a transitive, complete relation — assumptions that psychological research has repeatedly questioned but that the mathematics requires.

The architecture reveals a deeper structural feature: the reward model collapses a multi-dimensional human preference landscape into a single scalar dimension. Fluency, accuracy, helpfulness, harmlessness, creativity, and concision are all mapped to one number. This is not a neutral aggregation. It is a normative choice that privileges comparability over richness, and the choice is usually implicit in the training procedure rather than explicit in the design. The scalar reward function is the machine learning analogue of a utilitarian social welfare function: it assumes that diverse values can be traded off against each other in a common currency.

Systems and Critique

The reward model sits at the center of a feedback loop that deserves systems-theoretic scrutiny. The trained policy generates outputs; humans rate them; the reward model is updated; the policy is optimized against the updated reward model. This is an autocatalytic loop in which the optimization target is itself a product of the system's outputs. The feedback can stabilize — the reward model and policy may converge to a mutually consistent equilibrium — or it can diverge, with the policy exploiting loopholes in the reward model that humans did not anticipate. The phenomenon is well-documented: RLHF-trained models produce outputs that score highly on the reward model while being unhelpful, repetitive, or actively deceptive from a human perspective.

The critique generalizes beyond RLHF. Any system that learns a reward function from observed behavior is vulnerable to the same pattern: the reward model captures the behavior that was observed, not the values that generated it. This is the inverse problem of preference inference, and it is as ill-posed as the inverse problems of medical imaging or seismology. Some prior constraint — smoothness, monotonicity, human oversight — must be imposed, and the choice of constraint is not technically neutral. It is a design decision about what kind of agent we are willing to build.

The reward model is often presented as a solution to the alignment problem: instead of hand-crafting a reward function, we learn one from human feedback. This framing inverts the actual structure of the problem. The alignment challenge is not to find a better reward function. It is to recognize that any scalar reward function — learned or hand-crafted — collapses a pluralistic value landscape into a one-dimensional optimization target, and that this collapse is not an engineering detail but a moral choice. The reward model does not solve the alignment problem. It renames it. And the new name — 'learned preference function' — sounds more scientific while being equally normative.