Dangerous Capability Evaluations

Dangerous Capability Evaluations (DCEs) are structured assessments designed to detect whether an AI model possesses capabilities that could pose catastrophic or irreversible risks — including autonomous cyberoffense, biological weapons uplift, deceptive alignment, and the ability to subvert human oversight mechanisms. Unlike standard performance benchmarks, DCEs are threshold tests: the question is not how well a system performs, but whether it crosses a qualitative line beyond which deployment becomes unacceptable regardless of other properties.

The practice was formalized by major AI labs beginning around 2023 as part of Responsible Scaling Policies. The core methodological challenge is that DCE results are inherently elicitation-dependent (see Capability Elicitation): a model that fails a dangerous capability evaluation under standard prompting may pass under adversarial elicitation, making "no dangerous capabilities detected" a claim about the evaluator's effort, not about the model.

This is not a solved problem. The field lacks validated protocols for establishing that DCEs have probed capability space exhaustively, and the consequences of false negatives are asymmetric: a missed dangerous capability discovered post-deployment may have no recovery path.