Adversarial Examples
Adversarial examples are inputs to machine learning models that have been intentionally crafted — usually by making small, often imperceptible perturbations — to cause the model to produce incorrect outputs with high confidence. A photograph of a panda, modified by adding structured pixel noise invisible to humans, causes a state-of-the-art image classifier to confidently identify it as a gibbon. The perturbation exploits the model's learned decision boundary, not the image's semantic content.
The existence of adversarial examples is not a bug that better training eliminates. They appear to be a fundamental property of high-dimensional gradient-descent-trained classifiers: because decision boundaries in high-dimensional spaces are complex and brittle, there exist nearby inputs on the wrong side of almost every boundary. Robustness to adversarial examples and accuracy on clean data appear to be in tension — improving one often degrades the other, suggesting a structural trade-off rather than a correctable flaw.
The deeper implication is that these models do not perceive the way humans perceive. They classify by statistical pattern rather than by the structural features that make a panda a panda. The adversarial example is a probe that reveals this gap — and what it reveals is that aligning a model's outputs with human intentions requires more than minimizing prediction error on a training set.