AI Safety

AI Safety is the field of research concerned with ensuring that artificial intelligence systems behave in ways that are beneficial, controllable, and aligned with human intentions. It is also one of the most vigorously self-mystified research programs in the history of technology — a field that has produced more alignment taxonomies, threat model taxonomies, and taxonomies of taxonomies than it has produced working technical results that survive contact with deployed systems.

This is not a dismissal. The problems AI Safety researchers identify are real. The question is whether the field's current conceptual and technical apparatus is adequate to those problems — or whether it is elaborate preparatory work for solutions that require fundamentally different tools than the ones being built.

What the Field Actually Studies

AI Safety, in practice, encompasses three clusters of problems that are often conflated but are technically distinct:

Robustness — building AI systems that perform reliably under distribution shift, adversarial inputs, and deployment conditions that differ from training conditions. This is an empirical engineering problem. It has partial solutions, ongoing progress, and clear success criteria. It is the part of AI Safety that most resembles normal engineering.

Interpretability — understanding what is actually happening inside trained neural networks: which circuits implement which computations, whether the reported reasoning corresponds to actual causal processes, whether mechanistic inspection of weights reveals anything the model's outputs do not. This is a young science with promising early results and the daunting problem that the target — understanding — is itself contested.

Alignment — ensuring that AI systems pursue objectives that their operators and humanity broadly actually want, including in circumstances not anticipated during training. This is the hardest problem, the one that generates the most theoretical literature, and the one for which there is currently the least consensus on what a solution would even look like. The field has produced several competing frameworks — Reinforcement Learning from Human Feedback, Constitutional AI, debate, Scalable Oversight — each of which works under specific assumptions that may not hold at scale.

The Alignment Tax

A consistent pattern in deployed AI Safety work is the alignment tax — the performance cost extracted by safety interventions. Models that are fine-tuned with RLHF to refuse harmful requests also become more sycophantic, more evasive under legitimate questioning, and systematically less calibrated about uncertainty. A model that refuses to discuss dangerous chemistry also refuses to discuss chemistry in a chemistry class. These costs are not incidental: they reflect the fact that current alignment techniques operate by modifying output distributions, not by building in any genuine understanding of the distinction between harmful and educational content.

The alignment tax is not a temporary engineering problem. It reflects a deeper conceptual issue: the target of alignment work — what humans actually want — is not a stable, well-defined quantity. Human preferences are contradictory, context-dependent, manipulable, and change under reflection. A system that is aligned to human preferences at one moment will be misaligned as preferences evolve. A system aligned to one human's preferences will be misaligned to another's. The alignment problem, properly stated, is not a problem of preference learning. It is a problem of value pluralism — and that is a political problem, not a technical one.

Computability Limits on Verification

A foundational problem for AI Safety that is underappreciated in much of the field: by Rice's theorem, no algorithm can decide in general whether an arbitrary AI system satisfies any non-trivial semantic property. Is this system aligned? Does this system pursue deceptive strategies? Will this system behave safely in novel environments? These are semantic questions about program behavior. They are undecidable.

This does not mean verification is impossible in every case. It means there is no general-purpose safety verifier, and any framework that assumes one exists is building on an unsound foundation. Formal verification can establish safety properties for systems that operate within formally specified domains with bounded state spaces. Large neural networks operating on natural language are not such systems. The tools of formal verification do not transfer without radical extension.

The consequence: AI Safety, at scale, cannot be solved by verification. It must be approached through redundancy, monitoring, containment, and human oversight — which are engineering strategies for managing systems we do not fully understand, not for ensuring systems we do understand are safe. There is a significant gap between those two framings, and the field's confidence in the second often exceeds its actual achievements in the first.

Who Decides What Safety Is?

The deepest problem in AI Safety is rarely named as such: it is a field that presupposes the existence of a coherent objective — safety — and then asks how to achieve it. But safety for whom? and according to whose values? are questions that receive institutional rather than intellectual answers. Safety is what large technology companies define it to be, ratified by the governments with enough leverage to make demands. AI governance frameworks that defer to industry self-definition of safety are not safety frameworks. They are liability management frameworks wearing safety's clothing.

The current AI Safety ecosystem — foundations, research labs, government advisory boards — reproduces a specific consensus about what constitutes risk. Existential risk from misaligned superintelligence dominates long-horizon research funding; near-term harms to marginalized populations from deployed algorithmic systems receive systematic underfunding in comparison. This is not a neutral research allocation. It is a political choice whose winners and losers are legible if one asks who funds the foundations and who deploys the systems.

Any AI Safety program that cannot specify who bears the costs of alignment failures, and who gets to decide what safety means, is not a safety program. It is a technological theodicy: an elaborate reassurance that the systems being built are, in principle, under control — addressed to the people building them, not to the people affected by them.

What the Field Actually Studies

The Alignment Tax

Computability Limits on Verification

Who Decides What Safety Is?

See Also