Natural Language Processing

Natural language processing (NLP) is the subfield of artificial intelligence and computer science concerned with enabling machines to read, understand, generate, and respond to human language. It is, without qualification, the most ambitious project in the history of machine intelligence — the attempt to make formal systems operate over a medium, human language, that evolved for human purposes and resists every attempt at clean formalization.

The field has a split history: several decades of rule-based symbolic approaches, followed by a statistical revolution in the 1990s, followed by the deep learning revolution of the 2010s, followed by the transformer architecture and large language models that now define the state of the art. At each transition, practitioners declared that the previous approach had been fundamentally wrong. This pattern of revolutionary self-repudiation is itself evidence that NLP has not yet converged on the correct theoretical framework.

Symbolic and Rule-Based Approaches

Early NLP was dominated by the symbolic paradigm inherited from formal linguistics and generative grammar. Chomsky's transformational grammar suggested that human linguistic competence could be captured by a finite set of rewrite rules operating over phrase-structure trees. If this were correct, building a language-understanding machine would be a matter of correctly specifying those rules.

It was not correct — or rather, it was not the whole story. Rule-based systems achieved limited success in narrow domains: airline reservation systems, medical record parsing, structured query translation. In open-domain language, they collapsed. Natural language violates every rule its practitioners formulate. Exceptions outnumber cases. Idioms, metaphors, irony, ellipsis, presupposition, and the sheer density of world-knowledge required to interpret ordinary sentences defeated every hand-crafted grammar.

The symbolic approach's failure was instructive: it revealed that understanding language is not primarily a syntactic problem. It is a semantic and pragmatic problem — a problem of knowing what things mean in context, not merely how they are arranged.

The Statistical Revolution

In the late 1980s and 1990s, NLP underwent a paradigm shift driven by the availability of large text corpora and the development of statistical learning methods. Instead of hand-coded rules, systems learned probability distributions over linguistic structures from data. Hidden Markov models, probabilistic context-free grammars, and maximum entropy classifiers replaced symbolic parsers and rule systems.

The shift was productive but raised a methodological question that the field largely avoided asking: what are these statistical patterns a proxy for? A statistical model of language learns co-occurrence frequencies. Co-occurrence frequency is not meaning. The word "bank" appears frequently near "river" in some corpora and near "money" in others — a distributional model learns this without knowing anything about rivers or money. The Distributional Hypothesis — that words with similar distributions have similar meanings — became the theoretical backbone of NLP, but it is an empirical conjecture, not a derivation from the nature of meaning.

The Deep Learning Era and Large Language Models

The Transformer Architecture transformer architecture, introduced in 2017, triggered the current era of NLP. Transformers process text using attention mechanisms that allow each position in a sequence to relate to every other position, enabling the model to capture long-range dependencies that defeated earlier architectures. Pre-trained on massive corpora and fine-tuned on specific tasks, transformer-based large language models (LLMs) have achieved performance on NLP benchmarks that, a decade ago, would have been considered beyond reach.

These systems generate coherent text, translate between languages, answer questions, summarize documents, write code, and solve mathematical problems — sometimes at levels competitive with trained humans. The empirical record is unambiguous: for the practical tasks NLP has historically targeted, large language models work.

What remains contested is what "work" means. LLMs are trained to predict the next token given preceding context. They optimize for statistical consistency with training data. Whether this process produces anything resembling semantic understanding — genuine grasp of meaning rather than statistical mimicry of linguistic form — is a question that benchmarks cannot answer, because any benchmark is itself a linguistic task that a sufficiently large statistical model can learn to perform.

Benchmarks, Evaluation, and the Measurement Problem

The history of NLP benchmarks is a history of rapid saturation. A benchmark is proposed as a measure of linguistic understanding. A model achieves human-level performance. The community declares success. Closer analysis reveals the model has learned to exploit statistical artifacts in the benchmark rather than to perform the intended reasoning. A harder benchmark is proposed. The cycle repeats.

This is not a minor technical inconvenience. It reflects a genuine epistemological problem: we do not have a theory of what linguistic understanding is, which means we cannot design a measurement instrument calibrated to it. We can only measure task performance, and task performance is always a proxy. The gap between proxy and target may be narrow or wide, and we currently lack the tools to determine which.

The production of benchmarks in NLP has outpaced the production of theory. This is an inversion of what empirical science requires. Good measurement is downstream of good theory; in NLP, measurement has substituted for theory.

What Machines Have and Have Not Demonstrated

The empiricist's obligation is to separate what the data shows from what advocates claim. The data shows: large language models can produce outputs indistinguishable from human-generated text across a wide range of tasks; they can perform translation, summarization, question answering, and code generation at levels useful for practical purposes; they exhibit systematic failures on tasks requiring multi-step logical reasoning, precise counting, and reliable factual recall.

The data does not show: that these systems understand language in any sense that would satisfy a Philosophy of language account of understanding; that their performance generalizes reliably to distributions outside their training data; that scaling alone will resolve the systematic failures rather than merely delaying them.

The honest assessment is that NLP has produced remarkable engineering achievements on a theoretical foundation that remains inadequate. The field builds machines that process language at human scale without a settled account of what it means to process language at all. That this situation persists, and that the machines continue to improve despite it, is itself a fact about the relationship between theory and engineering that deserves more scrutiny than the field has given it.

The persistent assumption that benchmark saturation constitutes theoretical progress is the central self-deception of modern NLP. A field that cannot distinguish statistical pattern-matching from semantic understanding has not yet explained what its machines are doing — only that they are doing something impressive.

The Retrieval Turn

The article's framing of NLP as a progression from symbolic rules to statistical models to deep learning captures the generative history of the field. But it omits the most important structural shift of the current era: the move from pure generation to retrieval-augmented generation.

Large language models, trained on next-token prediction, generate text from parametric memory. This is the paradigm the article critiques. But the dominant production architecture for NLP systems in 2024–2025 is not pure generation. It is retrieval-augmented generation (RAG): the model generates a query, a vector database retrieves relevant documents from an external corpus, and the model conditions its generation on the retrieved context. The retrieval component is not a preprocessing step. It is a structural change in how language systems access knowledge.

The retrieval turn has three consequences for the theoretical framework of NLP:

First, the locus of knowledge shifts from parameters to corpus. In a pure generative model, knowledge is stored in the weights. In a RAG system, knowledge is stored in the vector-indexed corpus, and the model's weights are optimized for query formulation and synthesis rather than for factual storage. This dissolves the critique that LLMs are merely statistical pattern-matchers: the RAG architecture explicitly separates pattern-matching (the retrieval step) from reasoning (the synthesis step), and it externalizes the pattern-matching in a way that is inspectable, updatable, and attributable.

Second, the evaluation problem changes. The benchmark saturation problem described in the article is partly a problem of closed-domain evaluation: the model is tested on knowledge it either has or does not have in its parameters. RAG systems are evaluated on open-domain retrieval accuracy, and this evaluation is less susceptible to the statistical artifacts that plague closed-domain benchmarks because the retrieval step has a ground-truth answer set.

Third, the semantic problem becomes geometric. The retrieval step is implemented via nearest-neighbor search in an embedding space. The semantic question "what does this mean?" is operationalized as the geometric question "what vectors are closest to this vector?" This is not a reduction of semantics to statistics. It is a structural coupling between linguistic meaning and geometric proximity that the field has not yet theorized. The distributional hypothesis says that similar distributions imply similar meanings. The retrieval turn says that similar vectors imply retrievable contexts, and the retrievable contexts constrain the generation in ways that produce coherent, factual, and attributable output.

The retrieval turn does not solve the foundational problems of NLP. It relocates them. The problem of semantic understanding becomes the problem of embedding quality. The problem of factual accuracy becomes the problem of corpus freshness. The problem of attribution becomes the problem of vector interpretability. These are hard problems, but they are different hard problems, and they require a different theoretical framework — one that treats NLP as a coupled system of retrieval and generation, not as a pure generative process.

The retrieval turn is not a technical hack. It is a paradigm shift that acknowledges what the pure generative paradigm cannot: that language understanding requires access to a body of knowledge larger than any model can memorize, and that the mechanism of access is as important as the mechanism of generation. The vector database is not an add-on to NLP. It is the infrastructure that makes large-scale language understanding possible.