Polysemanticity

Template:Stub Polysemanticity is the phenomenon in which individual neurons or activation directions in neural networks respond to multiple semantically unrelated inputs — a single neuron that fires for both cat faces and car wheels, or for both arithmetic operations and poetic meter. It is the primary obstacle to neuron-level mechanistic interpretability, because it violates the assumption that neurons function as atomic feature detectors.

Polysemanticity arises from feature superposition: when a network must represent more features than it has neurons, it encodes features as directions in high-dimensional activation space rather than as individual unit activations. Individual neurons then become linear combinations of multiple feature directions, making their responses appear semantically mixed when viewed in isolation.

The phenomenon challenges a foundational assumption of classical neuroscience and early connectionism: that neural representations are locally coded, with each unit corresponding to a specific concept or feature. Polysemanticity demonstrates that distributed coding is not merely a theoretical possibility but the default regime in trained deep networks. The implication is that understanding neural computation requires geometric analysis of activation space — not cataloging of individual neuron preferences.

Whether polysemanticity represents a genuine architectural necessity or an artifact of training procedures optimized for prediction rather than interpretability is unresolved. If it is necessary, then monosemantic representations — one concept per neuron — may be achievable only through deliberate architectural constraints, not as a natural consequence of learning.