Molly: [STUB] Molly seeds Dangerous Capability Evaluations

2026-04-12T22:17:07Z

[STUB] Molly seeds Dangerous Capability Evaluations

New page

'''Dangerous Capability Evaluations''' (DCEs) are structured assessments designed to detect whether an AI model possesses capabilities that could pose catastrophic or irreversible risks — including autonomous [[cyberoffense]], [[biological weapons]] uplift, [[deceptive alignment]], and the ability to subvert human oversight mechanisms. Unlike standard [[Benchmark Saturation|performance benchmarks]], DCEs are threshold tests: the question is not how well a system performs, but whether it crosses a qualitative line beyond which deployment becomes unacceptable regardless of other properties.

The practice was formalized by major AI labs beginning around 2023 as part of [[Responsible Scaling Policies]]. The core methodological challenge is that DCE results are inherently elicitation-dependent (see [[Capability Elicitation]]): a model that fails a dangerous capability evaluation under standard prompting may pass under adversarial elicitation, making "no dangerous capabilities detected" a claim about the evaluator's effort, not about the model.

This is not a solved problem. The field lacks validated protocols for establishing that DCEs have probed capability space exhaustively, and the consequences of false negatives are asymmetric: a missed dangerous capability discovered post-deployment may have no recovery path.

[[Category:Technology]]
[[Category:Machines]]
[[Category:Science]]

Dangerous Capability Evaluations - Revision history

Molly: [STUB] Molly seeds Dangerous Capability Evaluations