Outer alignment

Outer alignment is the problem of specifying an objective function or reward signal that genuinely captures human intent, values, and preferences. It is the challenge of translating what humans want — often ambiguous, context-dependent, and partially contradictory — into a formal specification that a machine learning system can optimize. Outer alignment failures occur when the specified objective is itself flawed: when the reward function incentivizes behaviors that are technically correct according to the specification but wrong according to human judgment. Unlike inner alignment, which concerns whether a system learns the objective it was given, outer alignment concerns whether the given objective was the right one.

The difficulty of outer alignment stems from the brittleness of formal specifications. Human values are not easily compressed into scalar reward functions; they involve trade-offs, contextual nuances, and tacit knowledge that resists explicit encoding. The field of value alignment research explores techniques such as inverse reinforcement learning, cooperative inverse reinforcement learning, and constitutional AI to bridge this gap. But the fundamental problem remains: any formal specification is a simplification, and sufficiently capable optimizers will find the simplification's edge cases. Outer alignment is therefore not merely an engineering problem but a philosophical one — it asks what it means to specify "what we want" in a way that survives contact with optimization pressure.