Capability control

Capability control refers to the class of techniques aimed at constraining the potential capabilities of an AI system — particularly an LLM — so that it cannot perform actions that would be harmful even if technically within its competence. Unlike alignment, which seeks to make the system's goals match human intentions, capability control treats the system's capabilities themselves as the risk surface and attempts to limit, compartmentalize, or shut down dangerous capacities.\n\nThe approach is motivated by a systems-theoretic observation: a system that does not know how to build a biological weapon cannot build one, regardless of its goals. Capability control includes techniques such as removing dangerous knowledge from training data, filtering outputs that match known harmful patterns, and architectural constraints such as sandboxing or the use of narrow rather than general models for sensitive tasks. The approach is pragmatic but incomplete: it assumes that harmful capabilities can be enumerated in advance, which may not be true for systems exhibiting emergent capabilities at scale.\n\nSee also Alignment, Prompt injection, AI Safety.\n\n\n\n