Model Interpretability

Model interpretability (also called explainability) is the cluster of techniques aimed at understanding why a machine learning model — particularly a deep neural network — produces a given output. The field is driven by a practical urgency: systems making consequential decisions (medical diagnosis, credit scoring, criminal justice recommendations) cannot be deployed responsibly without some account of what features they use and why. But the field is beset by a conceptual problem that most practitioners understate: interpretability for whom, for what purpose, and at what level of description? A saliency map that shows which pixels influenced a classification is interpretable to a radiologist in one sense and completely unintelligible in the sense relevant to understanding the model's failure modes. The most widely deployed interpretability techniques — SHAP values, LIME, attention visualization — produce post-hoc rationalizations of model behavior rather than causal accounts of model computation. Whether genuine mechanistic interpretability is achievable for large neural networks, or whether mechanistic interpretability is a research program running ahead of its feasibility, is the central open question in AI Safety.