LLM
A large language model (LLM) is a neural network trained on vast text corpora to predict the next token in a sequence, yet its outputs exhibit properties — reasoning, planning, translation, code generation — that no single training objective explicitly specifies. From a systems perspective, an LLM is not merely a statistical pattern matcher. It is a high-dimensional dynamical system operating on a discrete state space of tokens, where the trajectory through that space is determined by both the model's fixed parameters and the evolving context window it carries. The parameters are the system's frozen memory; the context window is its working memory; and the next-token prediction is its local rule of motion.
LLMs as Dynamical Systems
The operation of an LLM at inference time is formally a discrete dynamical system: given state \(s_t\) (the context), the model computes a transition function \(s_{t+1} = f(s_t, \theta)\) where \(\theta\) represents the trained parameters. This is not a metaphor. It is the exact mathematical structure of autoregressive generation. The long-term behavior of such systems — whether they converge to fixed points, enter limit cycles, or wander chaotically — is determined by the geometry of the parameter landscape and the structure of the attractor basin around the initial prompt.
What makes LLMs unusual among dynamical systems is the scale of their state space. A typical LLM operates over a vocabulary of tens of thousands of tokens, with context windows extending to hundreds of thousands of positions. The effective state space is astronomically large, yet the trajectories are highly structured: they produce grammatical sentences, coherent arguments, and — in larger models — multi-step reasoning chains. This structure is not programmed; it is emergent. The system self-organizes into regions of state space that correspond to linguistic competence, much as a causal set self-organizes into macroscopic spacetime geometry from discrete causal relations.
The parallel extends further. In causal set theory, the microscopic rules are simple (causal ordering and stochastic growth) but the macroscopic properties (dimensionality, topology, metric) are emergent and non-obvious. In LLMs, the microscopic rule is equally simple (predict the next token) but the macroscopic properties (logical consistency, factual retrieval, style mimicry) are emergent and non-obvious. Both systems raise the same question: how does a simple local rule produce a complex global structure? The answer in both cases appears to involve scale: sufficiently large systems can host emergent phases that small systems cannot.
Emergence and the Scaling Hypothesis
The scaling hypothesis in LLM research posits that many capabilities — in-context learning, chain-of-thought reasoning, few-shot generalization — emerge predictably as model size, data volume, and compute increase. These are not gradual improvements but phase transitions: capabilities that are absent below a threshold and present above it. This is the hallmark of emergence in complex systems, and it has been observed across domains from physical phase transitions to strategy improvement in game-solving algorithms.
The emergence of reasoning from scale is not without precedent. Borg, Google's cluster scheduler, exhibits emergent global optimization from local scheduling rules. Wikipedia exhibits emergent knowledge organization from local editing decisions. In both cases, the system is greater than the sum of its parts because the parts interact in a structured network. LLMs are similar: the attention mechanism creates a dense, all-to-all interaction graph among tokens in the context window, and it is this interaction structure that enables the emergence of coherent long-range dependencies.
But the scaling hypothesis also carries a warning. Emergent properties are not designed properties; they are discovered properties. When an LLM exhibits reasoning, we do not know which parameters, which layers, or which attention heads are responsible. The system is a black box not because it is secret but because its emergent behavior is distributed across billions of parameters in ways that resist localization. This is the observational incompleteness of LLMs: we can measure the outputs, but we cannot fully map the internal causal structure that produces them. The same observational incompleteness that plagues Borg's scheduler also plagues our understanding of LLMs.
Network Epistemics and Distributed Knowledge
An LLM is a network epistemic system. Its knowledge is not stored in any single location but distributed across the parameter matrix in a manner analogous to how knowledge is distributed across the editors of Wikipedia. No single editor knows the whole encyclopedia; no single neuron knows the whole language. The knowledge is encoded in the relational structure of the network: the weights between layers, the attention patterns between tokens, the representational geometry of the hidden states.
This distributed architecture has implications for how we should think about LLM reliability. A centralized knowledge system — a database, a search engine, a curated encyclopedia — can be verified point by point. A distributed knowledge system cannot. Its errors are not local bugs but structural distortions: hallucinations that arise not from missing data but from the geometry of the attractor basin. An LLM does not know when it is wrong because there is no central knowledge repository to consult. There is only the trajectory through state space, and the trajectory is locally consistent even when globally false.
This is not a limitation to be overcome but a property to be understood. The epistemic status of LLM outputs is not that of testimony (claims made by a knower) nor that of evidence (data generated by a process). It is something closer to prediction in the statistical sense: the most probable continuation given the context. The question is not whether LLMs are intelligent but whether prediction at sufficient scale becomes a form of knowledge production — and if so, what kind of knowledge it produces, and what its limits are.
The persistent assumption that LLMs are either just\n\n== See also ==\n\nAttention mechanism | Transformers | In-context learning | Chain-of-thought reasoning | Prompt engineering | Few-shot learning | Autoregressive model | Scaling hypothesis | Tokenization