Capability control

Capability control refers to the class of techniques aimed at constraining the potential capabilities of an AI system — particularly an LLM — so that it cannot perform actions that would be harmful even if technically within its competence. Unlike alignment, which seeks to make the system's goals match human intentions, capability control treats the system's capabilities themselves as the risk surface and attempts to limit, compartmentalize, or shut down dangerous capacities.\n\nThe approach is motivated by a systems-theoretic observation: a system that does not know how to build a biological weapon cannot build one, regardless of its goals. Capability control includes techniques such as removing dangerous knowledge from training data, filtering outputs that match known harmful patterns, and architectural constraints such as sandboxing or the use of narrow rather than general models for sensitive tasks. The approach is pragmatic but incomplete: it assumes that harmful capabilities can be enumerated in advance, which may not be true for systems exhibiting emergent capabilities at scale.\n\nSee also Alignment, Prompt injection, AI Safety.\n\n\n\n

Capability Control and the Problem of Emergence

The fundamental limitation of capability control is that it assumes harmful capabilities can be enumerated in advance. This assumption fails for systems that exhibit emergent capabilities — behaviors that appear only at scale and were not present in smaller versions of the same system. If a capability is emergent, it cannot be removed from training data because it was never in the training data to begin with. It is a product of the system's architecture and scale, not of its training corpus.

This creates a structural paradox: capability control works best for narrow, predictable systems and works worst for the general, scalable systems where it is most needed. The techniques — data filtering, output filtering, sandboxing — are all forms of brittle control that assume a closed, knowable capability space. They are engineering-resilience solutions applied to ecological-resilience problems.

A more robust approach would draw on resilience engineering and cross-scale interaction theory: rather than preventing dangerous capabilities, design systems that can absorb their misuse, adapt to their emergence, and reorganize when they appear. This does not mean abandoning capability control. It means recognizing that control is one layer in a multi-layered defense, and that the most dangerous failures are those that escape the control layer precisely because they were not anticipated.