Value alignment

Value alignment is the field of research concerned with ensuring that artificial intelligence systems learn and pursue objectives that genuinely reflect human values, rather than optimizing for poorly specified proxies that diverge from human intent under optimization pressure. Value alignment is broader than the alignment problem in that it focuses specifically on the content of the goals — what the system should want — rather than merely the form — whether the system pursues what it was told to pursue. It intersects with moral philosophy (what values should be aligned?), economics (how do we aggregate conflicting values?), and psychology (how do humans actually form and revise their values?).

The central technical challenge in value alignment is that human values are not explicitly represented anywhere — they are implicit in human judgments, preferences, and behaviors, and they are often inconsistent, context-dependent, and subject to revision. Techniques such as inverse reinforcement learning (inferring reward functions from observed behavior), preference learning (eliciting preferences through comparison queries), and constitutional AI (training systems to adhere to explicit principles) all attempt to extract or approximate human values from limited data. But the deeper problem is normative: even if we could perfectly infer a person's values, those values may themselves be flawed — the product of ignorance, bias, or manipulation. Value alignment therefore requires not just technical sophistication but a theory of value formation and a commitment to corrigibility: the system's willingness to be corrected when its current understanding of values is challenged.