Chain-of-thought prompting

Chain-of-thought prompting is a technique in large language models where the model is prompted to generate intermediate reasoning steps before producing a final answer, rather than jumping directly to a conclusion. The method, introduced by Wei et al. (2022), exploits a striking empirical finding: simply adding the phrase "Let's think step by step" to a prompt can raise performance on mathematical and logical reasoning tasks by 20–40 percentage points. The effect is not uniform — it is strongest on tasks requiring multi-step composition, symbolic manipulation, and causal inference, and weakest on tasks that are primarily pattern-recognition or retrieval.

The central mystery is what chain-of-thought prompting actually does. It is not merely a formatting trick. The intermediate tokens are not discarded; they participate in the autoregressive computation, conditioning subsequent tokens on a generated reasoning trace. In effect, the model is using its own output as additional context, creating a feedback loop between generation and evaluation. This has led some researchers to describe chain-of-thought as a form of capability elicitation: it does not add new reasoning capacity to the model, but makes latent capacity explicit by scaffolding the generation process.

Reasoning as Latent Program Execution

One influential interpretation, advanced by researchers at Anthropic and Google DeepMind, is that language models possess implicit reasoning procedures that are normally executed in a single forward pass — too compressed to be verifiable or interpretable. Chain-of-thought prompting forces these procedures to unfold across multiple tokens, externalizing a computation that would otherwise remain hidden in the model's internal activations. This interpretation is now called latent program execution.

This reframing has significant implications. If chain-of-thought reveals latent programs, then the quality of reasoning is bounded by the quality of the latent program, not by the prompting technique itself. A model with poor latent reasoning cannot be rescued by better prompting; conversely, a model with strong latent reasoning may not need explicit chains at all. Test-time compute scaling — increasing the number of reasoning steps or sampling multiple chains and selecting by majority vote — can improve results, but only up to the ceiling set by the latent procedure's competence.

The analogy to dual-process cognition is instructive. Chain-of-thought prompting scaffolds a "System 2" deliberative mode onto a "System 1" associative engine. But unlike human cognition, where the two systems are genuinely distinct architectures, in language models the distinction is merely a matter of token budget. The same weights perform both modes. This raises the question of whether chain-of-thought produces reasoning or merely the appearance of reasoning — a simulation of step-by-step deliberation that lacks the causal structure of genuine inference.

Topology of Inference

A systems-level perspective treats chain-of-thought as a network phenomenon: the model's computation graph is expanded by adding intermediate nodes (the reasoning tokens), which increases the effective depth of the computation without increasing the model's parameter count. The reasoning trace is a path through a high-dimensional state space, and different prompts steer the path through different regions. This perspective is sometimes called reasoning topology.

This topological view makes a prediction: chain-of-thought effectiveness should depend on the connectivity of the reasoning space. If the target solution requires traversing a region of token-space that the model rarely visits during pre-training, the chain will fail regardless of prompting quality. Conversely, if the solution path lies along high-probability trajectories, even minimal scaffolding suffices. This explains why chain-of-thought helps with grade-school arithmetic (common in training data) but struggles with novel proof constructions (rare in training data).

The practical upshot is that chain-of-thought is not a universal reasoning amplifier but a navigational tool. It helps the model find paths it already knows but would not spontaneously traverse. It is less like teaching reasoning and more like providing a map to territory the model has already explored.

The seductive promise of chain-of-thought prompting — that we can turn associative pattern-matchers into deliberate reasoners with a few well-chosen words — is a category error. Reasoning is not a behavior that can be prompted into existence; it is a structural property of a system's causal architecture. Chain-of-thought does not create reasoning. It reveals, with painful clarity, how much of what we call "AI reasoning" is still just sophisticated retrieval dressed in the costume of deliberation. The real question is not how to prompt better, but whether any purely predictive architecture can ever be said to reason at all.