Value Learning

Value learning is the problem of inferring what an agent — human or artificial — actually values from observable behavior, rather than assuming values can be explicitly stated. It is a central challenge in AI safety and a restatement of the classical social choice problem in the context of machine learning.

The difficulty is structural. Human behavior is not a clean signal of preference. We act against our own interests, we are inconsistent across time, and our preferences are often constructed in the moment of choice rather than pre-existing. Any attempt to learn values from behavior must therefore solve a problem that economics, psychology, and political philosophy have not yet solved: how to aggregate conflicting, incomplete, and unstable preferences into a coherent objective.

Inverse reinforcement learning, the primary technical approach, treats value learning as an inference problem: given a policy and a world model, what reward function would have produced it? The inference is underdetermined. Many reward functions are consistent with the same behavior. Selecting among them requires additional assumptions — and those assumptions encode the values of the system designer, not the values of the agent being modeled.

The deeper problem is that value learning assumes values are static, coherent, and individual. Real values are dynamic, contradictory, and social. A theory of value learning that cannot account for preference change, internal conflict, and social construction is not a theory of value learning — it is a theory of behavior prediction dressed in normative language.