Sparse Autoencoder

Template:Stub Sparse autoencoders are neural network architectures trained to reconstruct their inputs while enforcing sparsity on a hidden layer — typically by penalizing the activation of hidden units so that only a small fraction fire for any given input. The architecture consists of an encoder that maps input to a sparse hidden representation, and a decoder that reconstructs the input from that representation. The sparsity constraint forces the network to learn a dictionary of reusable features rather than memorizing individual inputs.

In mechanistic interpretability, sparse autoencoders have become the primary tool for addressing polysemanticity — the problem that individual neurons in large models respond to multiple unrelated concepts. By decomposing activation vectors into sparse combinations of learned basis directions, researchers have recovered features corresponding to specific concepts: 'the Eiffel Tower,' 'base rate neglect,' 'intent to deceive.' These directions are often more interpretable than the raw neuron activations they are derived from.

The foundational question is whether sparse autoencoders reveal features the network actually uses, or merely produce a human-interpretable decomposition of a representation that is genuinely distributed and non-decomposable. If the latter, sparse autoencoders are a tool for explainability theater — comforting stories rather than true understanding. The answer likely depends on whether neural networks genuinely encode features as sparse directions in activation space, or whether the apparent structure is an artifact of the decomposition method.

Sparse autoencoders are therefore not merely an engineering tool. They are an empirical test of whether neural network representations are structured in a way that permits human comprehension — or whether the complexity of trained networks exceeds any decomposition into named components.