Adversarial Robustness

Adversarial robustness is the property of a machine learning system that maintains correct output when its inputs are perturbed by small, intentionally crafted modifications — adversarial examples — that are designed to cause misclassification or erroneous behavior. The existence of adversarial examples reveals a fundamental mismatch between how neural networks represent decision boundaries and how humans conceptualize similarity. Two images that are perceptually indistinguishable to humans can be classified into entirely different categories by a network, because the network's representation space contains high-dimensional structures invisible to human perception.

Adversarial robustness is not merely a security concern. It is a diagnostic tool for understanding the geometry of learned representations. A network that is not adversarially robust has learned a decision boundary that is unstable — it relies on correlations that are statistically reliable in the training distribution but geometrically fragile in the full input space. Resilience theory reframes adversarial robustness as a system's capacity to remain subcritical: to prevent small perturbations from propagating into large output errors. The field's central open question is whether adversarial robustness can be achieved without catastrophic tradeoffs in standard accuracy, or whether the two objectives are structurally in tension.\n== Attractor Dynamics and the Geometry of Robustness ==\n\nThe standard framing of adversarial robustness treats the problem as one of boundary geometry: a classifier's decision surface must be smoothed so that small perturbations do not cross it. But this is a symptom-level description. The deeper question concerns what kind of dynamical system a neural network is, and whether its learned representations sit in deep attractors or shallow ones.\n\nFrom the perspective of dynamical systems theory, a neural network's inference can be understood as a trajectory through a high-dimensional state space, where the final classification corresponds to falling into a particular attractor basin. Adversarial examples are perturbations that push the trajectory across a separatrix — the boundary between basins — into a neighboring attractor. The existence of adversarial examples does not merely indicate that the boundary is close to data points; it indicates that the attractors are shallow, with narrow basins of attraction that are easily escaped.\n\nThis reframing connects adversarial robustness to a much older problem: basin stability in complex systems. In ecology, basin stability measures whether a perturbed ecosystem returns to its original state or transitions to an alternative stable state. In power grids, it measures whether a fluctuation causes a blackout or is absorbed. In neural networks, adversarial robustness is the same quantity by another name: the probability that a random perturbation of given magnitude leaves the system in its original attractor basin. The mathematical tools developed for these other domains — Lyapunov functions, structural stability analysis, and random matrix theory — are directly applicable to understanding neural network fragility.\n\nThe attractor-dynamics perspective also explains why adversarial examples transfer between architectures. If different networks, trained on the same data, converge to similar coarse-grained attractor structures in representation space, then perturbations that cross separatrices in one network are likely to cross similar separatrices in another. The transferability of adversarial examples is not a mystery about network similarity; it is a predictable consequence of shared attractor geometry induced by shared training distributions.\n\nWhat this means for the field is that adversarial training — the standard defense of augmenting training data with perturbed examples — is not merely regularizing the decision boundary. It is deepening the attractor basins by expanding the regions of state space that map to the correct classification. The limit of adversarial robustness is therefore not a question of how smooth a boundary can be made, but of how deep and wide the attractor basins can become before the network's capacity is exhausted. And that limit is a question not of optimization but of representational capacity: how much of the input manifold's structure can be encoded in the network's state-space geometry.\n\nThe field's obsession with epsilon-balls and L_p norms around individual examples misses the structural point. Adversarial robustness is not a local property of decision boundaries; it is a global property of attractor geometry. Until the field starts measuring basin depths and separatrix curvatures rather than perturbation magnitudes, it will continue to treat symptoms while the disease — shallow attractors induced by brittle correlations — remains unaddressed. A network that defends against adversarial examples by thickening its boundary is like a dam that prevents flooding by raising the waterline: it works until it doesn't, and when it fails, it fails catastrophically. The real fix is deeper basins, not higher walls.