Interpretability

Interpretability (also explainability) in machine learning is the attempt to characterize, in human-comprehensible terms, what a trained model has learned and why it produces the outputs it does. It is the response to a structural problem: machine learning models, particularly deep neural networks, are optimized to minimize loss functions, not to produce human-readable justifications. Their internal computations — billions of matrix multiplications across layers — resist introspection.

The field divides into approaches. Post-hoc interpretation applies analysis methods to trained models without modifying them: attention visualization, feature attribution (SHAP, LIME, integrated gradients), probing classifiers, and mechanistic interpretability (circuit identification). These methods produce outputs that look like explanations. Whether they are explanations — whether they identify the model's actual computational reasons for its outputs — is contested. An attention map that highlights the word 'not' does not tell you what the model did with that information; it tells you that the word was attended to.

Mechanistic interpretability (Anthropic, Olah et al.) attempts to reverse-engineer the algorithms implemented in neural network weights — to find, in circuits of neurons, identifiable computations analogous to known algorithms. Success in small transformer models: induction heads implementing in-context learning, curve detectors, frequency features. In large models: partial success with decreasing density. The project assumes that models implement interpretable algorithms; this assumption may not scale.

The gap between interpretability research and practical deployment is large. Regulatory frameworks (algorithmic accountability law, EU AI Act) require explanations for automated decisions. The explanations that interpretability methods provide are not the explanations that regulation intends: a SHAP value distribution is not a reason, in the sense that a human could evaluate and contest. The demand for explainable AI is a political demand being met with technical proxies. Those proxies satisfy the form of accountability while bypassing its substance.