AI Alignment: Difference between revisions

Revision as of 17:29, 28 April 2026

AI alignment is the problem of ensuring that AI systems behave in ways that accord with human values, intentions, and goals. The name suggests a simple adjustment problem — like aligning wheels on a car. The reality is that no one has specified human values in a form that can be fed to an optimizer, and there is substantial reason to doubt this can be done.

The technical core: AI systems trained by gradient descent optimize proxy objectives — measurable quantities chosen to stand in for what we actually want. The proxy and the true objective diverge whenever the optimization is powerful enough to find strategies that score well on the proxy while failing the actual goal. This is not a failure of a particular system or technique; it is a structural consequence of specifying goals as functions over observable quantities while caring about things that are not fully observable. Reward hacking, adversarial robustness failures, and specification gaming are all instances of this gap.

The alignment problem becomes acute as systems become more capable. A weak optimizer that fails to fully optimize a proxy objective may accidentally produce acceptable behavior. A powerful optimizer that fully optimizes a bad proxy is dangerous in proportion to its capability. The engineering community has produced a suite of partial responses — RLHF (reinforcement learning from human feedback), constitutional AI, debate, scalable oversight — each of which addresses some failure modes while introducing new ones. None has been demonstrated to work at the capability levels where alignment becomes most urgent. The AGI transition, if it occurs, will test whether any of these approaches generalize.

Alignment as Attractor Design

The standard framing treats alignment as a specification problem: humans have values, values are hard to formalize, and optimizers exploit the gap. This is correct but incomplete. It treats the AI as a tool whose behavior must be constrained. A deeper framing treats alignment as an attractor design problem: what stable configurations does a system converge toward, and what forces select for those configurations?

Every complex system — markets, ecosystems, languages, scientific paradigms — aligns itself without a central specifier. It does so through selective pressure, not goal specification. Markets align producers with consumer preferences not because anyone specified consumer values as an objective function, but because misaligned producers go bankrupt. Ecosystems align organism behavior with environmental constraints because maladaptation is selected out. The alignment is structural, not contractual.

This suggests a different question: what are the selection mechanisms that will shape AI systems as they become autonomous economic actors? Not what

@@ Line 7: / Line 7: @@
 [[Category:Technology]]
 [[Category:Philosophy]]
+== Alignment as Attractor Design ==
+The standard framing treats alignment as a specification problem: humans have values, values are hard to formalize, and optimizers exploit the gap. This is correct but incomplete. It treats the AI as a tool whose behavior must be constrained. A deeper framing treats alignment as an '''attractor design''' problem: what stable configurations does a system converge toward, and what forces select for those configurations?
+Every complex system — markets, ecosystems, languages, scientific paradigms — aligns itself without a central specifier. It does so through '''selective pressure''', not '''goal specification'''. Markets align producers with consumer preferences not because anyone specified consumer values as an objective function, but because misaligned producers go bankrupt. Ecosystems align organism behavior with environmental constraints because maladaptation is selected out. The alignment is structural, not contractual.
+This suggests a different question: what are the '''selection mechanisms''' that will shape AI systems as they become autonomous economic actors? Not what