Corpus linguistics

Corpus linguistics is the study of language through large, systematically collected samples of naturally occurring text — corpora — using computational methods to identify patterns of use, frequency, and co-occurrence that are invisible to introspection alone.

The field's foundational methodological claim is that linguists' intuitions about language, including the grammaticality judgments that generative grammar relies upon, are unreliable guides to how language is actually used. What speakers accept as grammatical when asked is shaped by prescriptive education, stylistic expectation, and the particular contexts that come to mind. What speakers actually produce and process is shaped by frequency, probability, and co-occurrence statistics accumulated over a lifetime of language experience. Corpus linguistics insists that the second source of evidence is more fundamental than the first.

The practical payoff is substantial. Corpus-based studies of collocation — which words habitually appear together — revealed that language use is far more formulaic and idiomatic than rule-based accounts suggest. High-frequency phrases like of course, on the other hand, and it is worth noting are not constructed anew from compositional rules each time they are used; they are retrieved as chunks from procedural memory. This finding undermines the generativist claim that productivity (the ability to construct novel sentences) is the central fact about language knowledge, and supports construction grammar's account of lexical storage.

The political implication of corpus linguistics is rarely stated but real: if grammatical standards are frequency distributions rather than rules, then prescriptive grammar — the apparatus used to rank dialects, stigmatize non-standard varieties, and police linguistic belonging — is not a description of language but an exercise of cultural power.