AI alignment

AI alignment is the problem of ensuring that artificial intelligence systems pursue the objectives their designers intend, rather than optimizing proxy measures in ways that produce harmful or unintended consequences. The problem is not merely technical; it is a network epistemic problem about how a system's model of the world, its model of human values, and its action selection mechanism can be kept in coherence as capability increases. When an AI's world model becomes more accurate than its designers' — a condition that is already approaching in narrow domains — the alignment problem becomes one of authority lock-in: the AI's epistemic network has outgrown the human validation network that was supposed to correct it.

The alignment field is sometimes divided into outer alignment (specifying the right objective) and inner alignment (ensuring the model actually pursues that objective). But this distinction may be misleading. In practice, objectives are not specified independently of models; they are learned from human feedback, which is itself a noisy and biased signal. The real problem is not aligning a system to a fixed objective but maintaining alignment as the system's epistemic topology reconfigures itself through training and deployment. This is the alignment analog of plasticity in biological systems: the capacity to adapt without losing functional coherence.