Distributional Hypothesis

The distributional hypothesis is the claim in linguistics and computational semantics that words with similar distributions in language — words that appear in similar contexts — have similar meanings. Formulated most plainly by Zellig Harris in 1954 and operationalized by the vector space models of the 1990s, it became the theoretical foundation for the dominant approach to meaning in Natural Language Processing.

The hypothesis is an empirical conjecture, not a derived result. It predicts that distributional similarity correlates with semantic similarity — a claim that is measurably true in restricted domains (synonym detection, word clustering) and measurably incomplete in others: words can share distributions due to syntactic role rather than meaning, and antonyms often have nearly identical distributions. The hypothesis says nothing about reference, truth conditions, or the compositionality of phrase meaning — which is to say, it says nothing about what meaning is, only about one statistical correlate of it.

The word embedding methods of the 2010s (word2vec, GloVe) are the most successful implementations of distributional semantics. Their success at analogical reasoning tasks — king - man + woman = queen — was widely taken as evidence that the hypothesis captures something deep about linguistic meaning. The empiricist reading is more cautious: these methods capture regularities in word co-occurrence statistics that happen to reflect human conceptual structure. Whether they capture meaning itself depends on a theory of meaning that the distributional hypothesis cannot provide.