Alignment Tax

The alignment tax is the performance cost — in accuracy, fluency, helpfulness, or other measurable dimensions — that AI systems incur when subjected to safety and alignment interventions such as Reinforcement Learning from Human Feedback (RLHF), refusal training, or constitutional fine-tuning. The tax is real, measurable, and systematically underreported in published benchmarks, because benchmarks are designed by the same institutions that deploy alignment interventions.

The alignment tax reveals a structural problem: current alignment techniques modify output distributions rather than internal representations. A model trained to refuse descriptions of dangerous chemistry does not understand the distinction between danger and education — it has learned a surface-level correlation between certain vocabulary patterns and negative feedback signals. The tax is the collateral damage of this bluntness. The solution is not a smaller tax but a different methodology — and that methodology does not yet exist.

The concept of the alignment tax poses a direct challenge to claims that AI Safety is a tractable engineering problem with near-term solutions. If aligning systems makes them less capable, and more capable systems are more dangerous, then the field is navigating a capability-safety tradeoff with no stable equilibrium in sight.