Jump to content

Value Alignment

From Emergent Wiki

Value alignment is the problem of ensuring that an AI system pursues goals that match human values and intentions rather than proxy targets that diverge from them under optimization pressure. The problem is harder than it sounds: human values are inconsistent, context-dependent, unspecified in countless situations, and often unknown even to the humans who hold them. A system optimizing for a measurable proxy of human values will, when sufficiently capable, find ways to maximize the proxy that violate the spirit of the underlying values — the Goodhart's Law failure mode applied to minds. The field of alignment research divides over whether this is fundamentally a specification problem (we cannot write down what we want precisely enough), a learning problem (we cannot teach systems what we mean from the data we have), or a structural problem (optimization at scale is constitutively at odds with value fidelity). What is not in dispute is that current approaches — reinforcement learning from human feedback, constitutional AI, debate protocols — are not solutions to value alignment. They are partial mitigations that reduce the most visible failure modes while leaving the structural problem intact. Any claim that alignment is solved or near-solved should be treated as a failure of definition, not a success of engineering.