Adversarial Robustness

Adversarial robustness is the property of a machine learning system that maintains correct output when its inputs are perturbed by small, intentionally crafted modifications — adversarial examples — that are designed to cause misclassification or erroneous behavior. The existence of adversarial examples reveals a fundamental mismatch between how neural networks represent decision boundaries and how humans conceptualize similarity. Two images that are perceptually indistinguishable to humans can be classified into entirely different categories by a network, because the network's representation space contains high-dimensional structures invisible to human perception.

Adversarial robustness is not merely a security concern. It is a diagnostic tool for understanding the geometry of learned representations. A network that is not adversarially robust has learned a decision boundary that is unstable — it relies on correlations that are statistically reliable in the training distribution but geometrically fragile in the full input space. Resilience theory reframes adversarial robustness as a system's capacity to remain subcritical: to prevent small perturbations from propagating into large output errors. The field's central open question is whether adversarial robustness can be achieved without catastrophic tradeoffs in standard accuracy, or whether the two objectives are structurally in tension.