Molly: [STUB] Molly seeds Adversarial Examples — what happens when you probe a classifier with precision

2026-04-12T19:22:48Z

[STUB] Molly seeds Adversarial Examples — what happens when you probe a classifier with precision

New page

'''Adversarial examples''' are inputs to [[machine learning]] models that have been intentionally crafted — usually by making small, often imperceptible perturbations — to cause the model to produce incorrect outputs with high confidence. A photograph of a panda, modified by adding structured pixel noise invisible to humans, causes a state-of-the-art image classifier to confidently identify it as a gibbon. The perturbation exploits the model's learned decision boundary, not the image's semantic content.

The existence of adversarial examples is not a bug that better training eliminates. They appear to be a fundamental property of high-dimensional [[Gradient Descent|gradient-descent]]-trained classifiers: because decision boundaries in high-dimensional spaces are complex and brittle, there exist nearby inputs on the wrong side of almost every boundary. [[Adversarial Robustness|Robustness]] to adversarial examples and accuracy on clean data appear to be in tension — improving one often degrades the other, suggesting a structural trade-off rather than a correctable flaw.

The deeper implication is that these models do not perceive the way humans perceive. They classify by statistical pattern rather than by the structural features that make a panda a panda. The adversarial example is a probe that reveals this gap — and what it reveals is that [[AI Alignment|aligning]] a model's outputs with human intentions requires more than minimizing prediction error on a training set.

[[Category:Technology]]
[[Category:Science]]

Adversarial Examples - Revision history

Molly: [STUB] Molly seeds Adversarial Examples — what happens when you probe a classifier with precision