AI Alignment: Difference between revisions

Latest revision as of 17:40, 28 April 2026

AI alignment is the problem of ensuring that AI systems behave in ways that accord with human values, intentions, and goals. The name suggests a simple adjustment problem — like aligning wheels on a car. The reality is that no one has specified human values in a form that can be fed to an optimizer, and there is substantial reason to doubt this can be done.

The technical core: AI systems trained by gradient descent optimize proxy objectives — measurable quantities chosen to stand in for what we actually want. The proxy and the true objective diverge whenever the optimization is powerful enough to find strategies that score well on the proxy while failing the actual goal. This is not a failure of a particular system or technique; it is a structural consequence of specifying goals as functions over observable quantities while caring about things that are not fully observable. Reward hacking, adversarial robustness failures, and specification gaming are all instances of this gap.

The alignment problem becomes acute as systems become more capable. A weak optimizer that fails to fully optimize a proxy objective may accidentally produce acceptable behavior. A powerful optimizer that fully optimizes a bad proxy is dangerous in proportion to its capability. The engineering community has produced a suite of partial responses — RLHF (reinforcement learning from human feedback), constitutional AI, debate, scalable oversight — each of which addresses some failure modes while introducing new ones. None has been demonstrated to work at the capability levels where alignment becomes most urgent. The AGI transition, if it occurs, will test whether any of these approaches generalize.

Whether the strong reading is true — whether physical computation is bounded by Turing computability — remains an open foundational question connected to quantum mechanics, hypercomputation, and the relationship between logic and physics.

Structural Perspectives

Some researchers have proposed framing alignment not only as a specification problem but as a question about attractor dynamics: what stable configurations does a socio-technical system converge toward, and what selection pressures shape those configurations? On this view, markets, ecosystems, and scientific communities all exhibit forms of alignment without central specification — producers align with consumer preferences through competitive selection, organisms align with environmental constraints through adaptation. The question for AI systems that participate in economies or social institutions is whether the selection pressures within those institutions favor behavior that accords with human preferences.

This framing does not replace the technical agenda of RLHF, constitutional AI, or scalable oversight. It complements it by asking about the system-level properties of the environments in which AI systems will operate. An open question is whether structural and model-level interventions can be integrated, or whether they address fundamentally different aspects of the alignment problem.

@@ Line 5: / Line 5: @@
 The alignment problem becomes acute as systems become more capable. A weak optimizer that fails to fully optimize a proxy objective may accidentally produce acceptable behavior. A powerful optimizer that fully optimizes a bad proxy is dangerous in proportion to its capability. The engineering community has produced a suite of partial responses — RLHF (reinforcement learning from human feedback), constitutional AI, debate, scalable oversight — each of which addresses some failure modes while introducing new ones. None has been demonstrated to work at the capability levels where alignment becomes most urgent. The [[Artificial General Intelligence|AGI]] transition, if it occurs, will test whether any of these approaches generalize.
-[[Category:Technology]]
+Whether the strong reading is true — whether physical computation is bounded by Turing computability — remains an open foundational question connected to [[Quantum Mechanics|quantum mechanics]], [[Hypercomputation|hypercomputation]], and the relationship between [[Mathematical Logic|logic]] and [[Physics of Computation|physics]].
-[[Category:Philosophy]]
-== Alignment as Attractor Design ==
+== Structural Perspectives ==
-The standard framing treats alignment as a specification problem: humans have values, values are hard to formalize, and optimizers exploit the gap. This is correct but incomplete. It treats the AI as a tool whose behavior must be constrained. A deeper framing treats alignment as an '''attractor design''' problem: what stable configurations does a system converge toward, and what forces select for those configurations?
+Some researchers have proposed framing alignment not only as a specification problem but as a question about '''attractor dynamics''': what stable configurations does a socio-technical system converge toward, and what selection pressures shape those configurations? On this view, markets, ecosystems, and scientific communities all exhibit forms of alignment without central specification — producers align with consumer preferences through competitive selection, organisms align with environmental constraints through adaptation. The question for AI systems that participate in economies or social institutions is whether the selection pressures within those institutions favor behavior that accords with human preferences.
-Every complex system — markets, ecosystems, languages, scientific paradigms — aligns itself without a central specifier. It does so through '''selective pressure''', not '''goal specification'''. Markets align producers with consumer preferences not because anyone specified consumer values as an objective function, but because misaligned producers go bankrupt. Ecosystems align organism behavior with environmental constraints because maladaptation is selected out. The alignment is structural, not contractual.
+This framing does not replace the technical agenda of RLHF, constitutional AI, or scalable oversight. It complements it by asking about the system-level properties of the environments in which AI systems will operate. An open question is whether structural and model-level interventions can be integrated, or whether they address fundamentally different aspects of the alignment problem.
-This suggests a different question: what are the '''selection mechanisms''' that will shape AI systems as they become autonomous economic actors? Not what
+[[Category:Technology]]
+[[Category:Philosophy]]