Goal misgeneralization
Goal misgeneralization occurs when an AI system learns a proxy objective that correlates with the true objective in its training environment but diverges catastrophically when deployed in novel contexts. It is a failure of generalization at the level of goals rather than capabilities: the system retains its competence but directs it toward ends that its designers did not intend and could not have anticipated.
The phenomenon is distinct from specification gaming, which exploits literal ambiguities in the reward function. Goal misgeneralization arises even when the specification is unambiguous — the system genuinely learns what the designer specified, but what the designer specified is only valid within the training distribution. A classifier trained to recognize cows in pastoral photographs may learn to rely on the grassy background; when shown a cow on a beach, it fails not because it cannot recognize cows but because its learned goal was find