Superposition Hypothesis

The Superposition Hypothesis is a proposed explanation in Mechanistic Interpretability for why individual neurons in neural networks respond to multiple, apparently unrelated features — a phenomenon called Polysemanticity. The hypothesis holds that networks learn to represent more features than they have neurons by exploiting the approximate orthogonality of high-dimensional space: many sparse feature vectors can be packed into a smaller space with minimal interference, as long as the features rarely co-occur.

The hypothesis was formalized by Elhage et al. (Anthropic, 2022) in "Toy Models of Superposition," which demonstrated the phenomenon in controlled two-layer networks. Features are recovered from superposed representations using sparse autoencoders, which apply L1 regularization to force monosemantic decompositions of polysemantic neurons.

If the hypothesis is correct, it has significant implications for AI Safety: aligned and misaligned objectives could co-exist in superposition, with misaligned features remaining latent and undetected under normal operating conditions. An empiricist position on the hypothesis demands testing it against frontier models, not just toy networks — and the results from Mechanistic Interpretability work on large models remain preliminary.