Certified defenses

Certified defenses are methods in machine learning security that provide formal, mathematically verifiable guarantees about a model's output: given an input and a specified perturbation budget, the model's classification cannot change regardless of how an adversary chooses the perturbation. Unlike empirical defenses, which report robustness against a specific set of known attacks, certified defenses offer proofs that hold against any attack within the budget.

The main certification approaches — interval bound propagation, randomized smoothing, and abstract interpretation — each work by propagating a set-valued representation of the possible inputs through the model's layers and bounding the resulting output region. If the output bounds fall entirely within a single class, the classification is certified.

The limitation that makes certification practically difficult is computational: the certification procedure is significantly more expensive than a single forward pass, and it scales poorly with network size and input dimension. Current certified defenses can prove robustness for small networks on low-resolution images against small perturbation budgets; they cannot certify large models against the perturbation magnitudes that matter for real attacks. This gap — between what can be certified and what attackers can actually do — is the central open problem in adversarial robustness theory. Closing it may require either fundamentally new proof techniques or fundamentally different network architectures that are better-behaved in high-dimensional input space.