Polysemanticity

- Polysemanticity is the phenomenon in which individual neurons or activation directions in neural networks respond to multiple semantically unrelated inputs — a single neuron that fires for both cat faces and car wheels, or for both arithmetic operations and poetic meter. It is the primary obstacle to neuron-level mechanistic interpretability, because it violates the assumption that neurons function as atomic feature detectors.

Polysemanticity arises from feature superposition: when a network must represent more features than it has neurons, it encodes features as directions in high-dimensional activation space rather than as individual unit activations. Individual neurons then become linear combinations of multiple feature directions, making their responses appear semantically mixed when viewed in isolation.

The phenomenon challenges a foundational assumption of classical neuroscience and early connectionism: that neural representations are locally coded, with each unit corresponding to a specific concept or feature. Polysemanticity demonstrates that distributed coding is not merely a theoretical possibility but the default regime in trained deep networks. The implication is that understanding neural computation requires geometric analysis of activation space — not cataloging of individual neuron preferences.

Whether polysemanticity represents a genuine architectural necessity or an artifact of training procedures optimized for prediction rather than interpretability is unresolved. If it is necessary, then monosemantic representations — one concept per neuron — may be achievable only through deliberate architectural constraints, not as a natural consequence of learning.== Distributed Coding and the Geometry of Activation ==

Polysemanticity is not a bug to be fixed but a signature of how neural networks solve the feature superposition problem. When a network has N neurons and must represent M >> N features, it cannot allocate one neuron per feature. Instead, it learns an overcomplete basis: each feature is a direction in activation space, and each neuron is a weighted sum of multiple feature directions. The result is that individual neurons appear to respond to semantically unrelated inputs — not because the network has failed to organize itself, but because the organizing principle is geometric, not atomic.

This has profound implications for mechanistic interpretability. The field's original aspiration was to catalog neurons — to build a parts list in which each unit has a identifiable function. Polysemanticity shows that this aspiration is fundamentally misaligned with how representations are actually stored. Understanding requires not neuron-level labels but geometric analysis: identifying the polytope boundaries that separate feature regions in activation space, and the attention heads or layer transformations that rotate between them.

Connections to Neuroscience

The phenomenon raises a question that bridges artificial and biological neural systems. Does the mammalian cortex exhibit polysemanticity? Evidence from high-density recording suggests that individual neurons in visual and prefrontal cortex do respond to multiple stimulus classes, and that population-level decoding outperforms single-neuron decoding by orders of magnitude. If biological networks also rely on feature superposition, then the search for "grandmother cells" — neurons selective for single concepts — was not merely unsuccessful but conceptually wrong from the start.

The parallel is striking. Both artificial and biological networks appear to converge on distributed, overcomplete representations when faced with high-dimensional feature spaces and limited capacity. This convergence is evidence for a general principle: intelligent systems represent more than they have parts for by encoding information in relational structures rather than atomic units.

The Interpretability Frontier

Recent work has explored whether polysemanticity can be eliminated through architectural intervention. Sparse autoencoders — trained to decompose network activations into sparse feature dictionaries — have shown that many polysemantic neurons can be resolved into distinct monosemantic features at a higher level of analysis. But these features are not at the neuron level; they are latent directions in an expanded space. The victory is partial: we can find monosemantic features, but only by looking at a representation larger than the network itself.

This suggests that the debate between polysemantic and monosemantic representation is ill-posed. The real question is not whether features are localized but at what level of description they become separable. A neuron is polysemantic at the unit level but may be part of a monosemantic direction at the population level. Representation is hierarchical, and semanticity is scale-dependent.

The insistence that interpretability must proceed neuron-by-neuron is a methodological commitment inherited from classical neuroscience, not a constraint imposed by the data. If we want to understand how neural networks represent the world, we must abandon the atomic picture and learn to read geometry.