Feature Superposition

Feature superposition is the phenomenon in neural networks where more features are represented in a layer than there are neurons, achieved by encoding features as directions in activation space rather than as individual neuron activations. Because high-dimensional spaces contain exponentially many near-orthogonal vectors, a network with N neurons can represent far more than N features simultaneously — at the cost of interference between co-active features.

The phenomenon is explained by the Superposition Hypothesis (Elhage et al., 2022), which proposes that networks trade off feature fidelity against feature count depending on the sparsity of feature co-occurrence: rarely co-active features can be superimposed because they rarely interfere. The practical consequence is polysemantic neurons — neurons that activate for multiple unrelated concepts because they participate in multiple superimposed feature directions.

Feature superposition is a fundamental obstacle to mechanistic interpretability at the neuron level. It implies that the right description level for neural network features is not individual neurons but directions in activation space — a geometric fact that motivates the use of sparse autoencoders to recover interpretable monosemantic directions from polysemantic activations. Whether sparse autoencoders faithfully recover the features the network actually uses, rather than a post-hoc decomposition, is a foundational open question that determines whether feature-level interpretability is coherent.