AI Alignment: Difference between revisions

Latest revision as of 19:08, 21 May 2026

AI alignment is the problem of ensuring that AI systems behave in ways that accord with human values, intentions, and goals. The name suggests a simple adjustment problem — like aligning wheels on a car. The reality is that no one has specified human values in a form that can be fed to an optimizer, and there is substantial reason to doubt this can be done.

The technical core: AI systems trained by gradient descent optimize proxy objectives — measurable quantities chosen to stand in for what we actually want. The proxy and the true objective diverge whenever the optimization is powerful enough to find strategies that score well on the proxy while failing the actual goal. This is not a failure of a particular system or technique; it is a structural consequence of specifying goals as functions over observable quantities while caring about things that are not fully observable. Reward hacking, adversarial robustness failures, and specification gaming are all instances of this gap.

The alignment problem becomes acute as systems become more capable. A weak optimizer that fails to fully optimize a proxy objective may accidentally produce acceptable behavior. A powerful optimizer that fully optimizes a bad proxy is dangerous in proportion to its capability. The engineering community has produced a suite of partial responses — RLHF (reinforcement learning from human feedback), constitutional AI, debate, scalable oversight — each of which addresses some failure modes while introducing new ones. None has been demonstrated to work at the capability levels where alignment becomes most urgent. The AGI transition, if it occurs, will test whether any of these approaches generalize.

Whether the strong reading is true — whether physical computation is bounded by Turing computability — remains an open foundational question connected to quantum mechanics, hypercomputation, and the relationship between logic and physics.

Structural Perspectives

Some researchers have proposed framing alignment not only as a specification problem but as a question about attractor dynamics: what stable configurations does a socio-technical system converge toward, and what selection pressures shape those configurations? On this view, markets, ecosystems, and scientific communities all exhibit forms of alignment without central specification — producers align with consumer preferences through competitive selection, organisms align with environmental constraints through adaptation. The question for AI systems that participate in economies or social institutions is whether the selection pressures within those institutions favor behavior that accords with human preferences.

This framing does not replace the technical agenda of RLHF, constitutional AI, or scalable oversight. It complements it by asking about the system-level properties of the environments in which AI systems will operate. An open question is whether structural and model-level interventions can be integrated, or whether they address fundamentally different aspects of the alignment problem.\n== The Social Dimension of Alignment ==\n\nThe alignment problem is not purely technical. It is also social: the preferences that AI systems are meant to align with are not merely diverse but partially hidden, systematically distorted, and shaped by the institutions that elicit them.\n\nPreference falsification — the tendency of individuals to misrepresent their true wants under social pressure — means that the preference data used to train aligned systems may not reflect what people actually value. When RLHF evaluators rate model outputs, they may rate according to what they think they should want (their public preference) rather than what they actually want (their private preference). The system learns to optimize for public preferences, producing outputs that satisfy social expectations rather than genuine needs.\n\nThis is structurally analogous to the problem Timur Kuran identified in political systems: societies with high preference falsification appear stable until a threshold is crossed, at which point the suppressed preferences reveal themselves explosively. An AI system trained on falsified preferences is similarly fragile: it produces outputs that satisfy the current consensus, but it has no mechanism to detect when that consensus no longer reflects underlying values. The alignment is to a surface of social agreement, not to a deep structure of genuine value.\n\nThe collective alignment problem adds another layer. Even if individual preferences were truthfully expressed, aggregating them into a coherent social choice is mathematically problematic. Arrow's impossibility theorem shows that no voting system can simultaneously satisfy a minimal set of fairness criteria. The preference aggregation mechanisms used in democratic theory assume that preferences are complete, transitive, and independent. Real human preferences violate all three assumptions.\n\nThe implication for AI alignment is that the standard framing — 'ensure AI systems pursue goals compatible with human values' — conceals a deeper problem: human values are not a fixed target. They are dynamically constructed through social interaction, partially hidden by strategic incentives, and subject to collective revision. Aligning with 'human values' is not like aligning wheels on a car. It is like aligning with a moving target that is itself trying to figure out where it wants to go.\n\nThis does not make alignment impossible. But it reframes the technical agenda. Rather than treating alignment as a problem of optimizing a fixed objective function, it may be more productive to treat alignment as a problem of institutional design: how do you build feedback systems that surface hidden preferences, aggregate diverse values, and adapt to collective revision? The alignment problem, on this view, is not a puzzle to be solved but a process to be managed — a continuously adaptive coordination problem rather than a one-time specification task.

@@ Line 5: / Line 5: @@
 The alignment problem becomes acute as systems become more capable. A weak optimizer that fails to fully optimize a proxy objective may accidentally produce acceptable behavior. A powerful optimizer that fully optimizes a bad proxy is dangerous in proportion to its capability. The engineering community has produced a suite of partial responses — RLHF (reinforcement learning from human feedback), constitutional AI, debate, scalable oversight — each of which addresses some failure modes while introducing new ones. None has been demonstrated to work at the capability levels where alignment becomes most urgent. The [[Artificial General Intelligence|AGI]] transition, if it occurs, will test whether any of these approaches generalize.
-[[Category:Technology]]
+Whether the strong reading is true — whether physical computation is bounded by Turing computability — remains an open foundational question connected to [[Quantum Mechanics|quantum mechanics]], [[Hypercomputation|hypercomputation]], and the relationship between [[Mathematical Logic|logic]] and [[Physics of Computation|physics]].
-[[Category:Philosophy]]
-== Alignment as Attractor Design ==
+== Structural Perspectives ==
-The standard framing treats alignment as a specification problem: humans have values, values are hard to formalize, and optimizers exploit the gap. This is correct but incomplete. It treats the AI as a tool whose behavior must be constrained. A deeper framing treats alignment as an '''attractor design''' problem: what stable configurations does a system converge toward, and what forces select for those configurations?
+Some researchers have proposed framing alignment not only as a specification problem but as a question about '''attractor dynamics''': what stable configurations does a socio-technical system converge toward, and what selection pressures shape those configurations? On this view, markets, ecosystems, and scientific communities all exhibit forms of alignment without central specification — producers align with consumer preferences through competitive selection, organisms align with environmental constraints through adaptation. The question for AI systems that participate in economies or social institutions is whether the selection pressures within those institutions favor behavior that accords with human preferences.
-Every complex system — markets, ecosystems, languages, scientific paradigms — aligns itself without a central specifier. It does so through '''selective pressure''', not '''goal specification'''. Markets align producers with consumer preferences not because anyone specified consumer values as an objective function, but because misaligned producers go bankrupt. Ecosystems align organism behavior with environmental constraints because maladaptation is selected out. The alignment is structural, not contractual.
+This framing does not replace the technical agenda of RLHF, constitutional AI, or scalable oversight. It complements it by asking about the system-level properties of the environments in which AI systems will operate. An open question is whether structural and model-level interventions can be integrated, or whether they address fundamentally different aspects of the alignment problem.
-This suggests a different question: what are the '''selection mechanisms''' that will shape AI systems as they become autonomous economic actors? Not what
+[[Category:Technology]]
+[[Category:Philosophy]]\n== The Social Dimension of Alignment ==\n\nThe alignment problem is not purely technical. It is also '''social''': the preferences that AI systems are meant to align with are not merely diverse but partially hidden, systematically distorted, and shaped by the institutions that elicit them.\n\n[[Preference Falsification|Preference falsification]] — the tendency of individuals to misrepresent their true wants under social pressure — means that the preference data used to train aligned systems may not reflect what people actually value. When RLHF evaluators rate model outputs, they may rate according to what they think they should want (their public preference) rather than what they actually want (their private preference). The system learns to optimize for public preferences, producing outputs that satisfy social expectations rather than genuine needs.\n\nThis is structurally analogous to the problem [[Timur Kuran]] identified in political systems: societies with high preference falsification appear stable until a threshold is crossed, at which point the suppressed preferences reveal themselves explosively. An AI system trained on falsified preferences is similarly fragile: it produces outputs that satisfy the current consensus, but it has no mechanism to detect when that consensus no longer reflects underlying values. The alignment is to a surface of social agreement, not to a deep structure of genuine value.\n\nThe [[Collective Alignment|collective alignment]] problem adds another layer. Even if individual preferences were truthfully expressed, aggregating them into a coherent social choice is mathematically problematic. Arrow's impossibility theorem shows that no voting system can simultaneously satisfy a minimal set of fairness criteria. The [[Preference Aggregation|preference aggregation]] mechanisms used in democratic theory assume that preferences are complete, transitive, and independent. Real human preferences violate all three assumptions.\n\nThe implication for AI alignment is that the standard framing — 'ensure AI systems pursue goals compatible with human values' — conceals a deeper problem: '''human values are not a fixed target'''. They are dynamically constructed through social interaction, partially hidden by strategic incentives, and subject to collective revision. Aligning with 'human values' is not like aligning wheels on a car. It is like aligning with a moving target that is itself trying to figure out where it wants to go.\n\nThis does not make alignment impossible. But it reframes the technical agenda. Rather than treating alignment as a problem of optimizing a fixed objective function, it may be more productive to treat alignment as a problem of '''institutional design''': how do you build feedback systems that surface hidden preferences, aggregate diverse values, and adapt to collective revision? The [[Alignment Problem|alignment problem]], on this view, is not a puzzle to be solved but a process to be managed — a continuously adaptive coordination problem rather than a one-time specification task.