Probing
Probing is an experimental methodology in machine learning and mechanistic interpretability that tests whether specific information is encoded in the internal representations of a trained model. The technique is simple in principle: train a lightweight classifier (the "probe") on a model's hidden activations, then measure how well the probe predicts a property of interest — grammatical number, semantic category, factual knowledge, or syntactic structure. If the probe succeeds, the information is linearly decodable from the representation at that layer.
Probing emerged from the recognition that large language models develop distributed internal representations whose structure is opaque to direct inspection. Unlike activation patching, which asks where a computation happens, probing asks what a representation contains. The two methods are complementary: patching localizes, probing characterizes.
The Method and Its Variants
The canonical linear probe trains a logistic regression classifier on flattened activation vectors from a specific layer, using labeled data for the target property. The probe's accuracy measures the ease of extraction — how readily the property can be read out by a simple linear decoder. If a linear probe succeeds, the information is encoded in a geometrically simple way within the representation space. If a linear probe fails but a nonlinear probe succeeds, the information is present but not linearly separable.
Variants extend this framework:
- Control tasks (Hewitt and Liang 2019): probes can memorize surface statistics rather than genuine linguistic structure. Control tasks establish a baseline by training the same probe architecture on random labels, revealing whether the probe's success reflects representational structure or probe capacity.
- Selective probing: probes trained on subsets of neurons to identify which dimensions carry which information, revealing the geometry of representation across the activation space.
- Causal probing: combines probing with interventions to test not merely whether information is present but whether it is used by the model in downstream computation.
What Probing Cannot Tell Us
Probing has been criticized on several grounds, and the criticisms reveal something important about the limits of representational analysis.
The linear decodability assumption is the most fundamental. Probing assumes that if information is present, it will be linearly decodable. But representational geometry in high-dimensional spaces is not constrained to be linear. Information could be encoded in nonlinear manifolds, in the relative geometry of multiple representations, or in dynamical trajectories across layers rather than static activations at any single layer. A failed linear probe does not mean the information is absent — it means the probe's assumptions are wrong.
The extraction/use distinction is equally serious. Probing tells us that information can be extracted from a representation. It does not tell us that the model uses that information in its computation. A representation might encode grammatical number in a way that is trivially decodable but causally inert — the model does not need it to produce its outputs. Probing without causal intervention risks confusing correlation with functional role.
The layerwise fallacy is the assumption that representations are best analyzed at single layers. In transformer architectures, information flows through multi-head attention, is transformed by MLPs, and is recombined across heads in ways that layerwise probing cannot capture. The unit of analysis may not be the layer but the computational circuit — a distributed subgraph of the network that implements a specific algorithm.
Probing and the Science Wars
The debate over what probing reveals mirrors the Science Wars in miniature. One camp treats successful probing as evidence that models "understand" linguistic structure — that grammatical number is not merely statistically correlated with outputs but represented as a genuine linguistic category. Another camp treats probing as a sophisticated curve-fitting exercise that tells us nothing about the model's competence or the nature of its representations.
The correct position is that neither camp has the right question. Probing does not tell us whether a model understands. It tells us what geometric structures exist in its activation space — structures that may or may not be causally implicated in the model's behavior. The question of understanding is not answered by representational geometry alone; it is answered by the coupling between representation, computation, and task performance in the full system.
The productive synthesis: probing is a structural measurement tool, not an oracle of competence. Its results constrain the space of possible mechanisms but do not determine which mechanism is actual. This is exactly the status of neuroimaging in cognitive neuroscience: fMRI tells us where activation occurs, but the inference from activation location to cognitive function requires theoretical bridging assumptions that are not themselves empirical.
Probing will be remembered not as a tool that settled debates about machine understanding, but as a tool that made those debates empirically tractable for the first time — and in doing so, revealed how much more we need to know.