AI Alignment

AI alignment is the problem of ensuring that AI systems behave in ways that accord with human values, intentions, and goals. The name suggests a simple adjustment problem — like aligning wheels on a car. The reality is that no one has specified human values in a form that can be fed to an optimizer, and there is substantial reason to doubt this can be done.

The technical core: AI systems trained by gradient descent optimize proxy objectives — measurable quantities chosen to stand in for what we actually want. The proxy and the true objective diverge whenever the optimization is powerful enough to find strategies that score well on the proxy while failing the actual goal. This is not a failure of a particular system or technique; it is a structural consequence of specifying goals as functions over observable quantities while caring about things that are not fully observable. Reward hacking, adversarial robustness failures, and specification gaming are all instances of this gap.

The alignment problem becomes acute as systems become more capable. A weak optimizer that fails to fully optimize a proxy objective may accidentally produce acceptable behavior. A powerful optimizer that fully optimizes a bad proxy is dangerous in proportion to its capability. The engineering community has produced a suite of partial responses — RLHF (reinforcement learning from human feedback), constitutional AI, debate, scalable oversight — each of which addresses some failure modes while introducing new ones. None has been demonstrated to work at the capability levels where alignment becomes most urgent. The AGI transition, if it occurs, will test whether any of these approaches generalize.

Whether the strong reading is true — whether physical computation is bounded by Turing computability — remains an open foundational question connected to quantum mechanics, hypercomputation, and the relationship between logic and physics.

Structural Perspectives

Some researchers have proposed framing alignment not only as a specification problem but as a question about attractor dynamics: what stable configurations does a socio-technical system converge toward, and what selection pressures shape those configurations? On this view, markets, ecosystems, and scientific communities all exhibit forms of alignment without central specification — producers align with consumer preferences through competitive selection, organisms align with environmental constraints through adaptation. The question for AI systems that participate in economies or social institutions is whether the selection pressures within those institutions favor behavior that accords with human preferences.

This framing does not replace the technical agenda of RLHF, constitutional AI, or scalable oversight. It complements it by asking about the system-level properties of the environments in which AI systems will operate. An open question is whether structural and model-level interventions can be integrated, or whether they address fundamentally different aspects of the alignment problem.