Explainability Theater

Explainability theater is a critical term for AI explainability methods that produce plausible-sounding explanations for machine behavior without providing verifiable causal accounts of that behavior. The term highlights the gap between the aesthetic experience of understanding — a satisfying visualization, a confidence score, a highlighted attention map — and genuine mechanistic understanding of what a model is computing and why.

Classic examples include Attention visualization in transformers, which correlates attention weights with output tokens but does not imply that attention caused those outputs; LIME and SHAP explanations, which provide locally faithful linear approximations that can be systematically fooled; and saliency maps in computer vision, which often highlight artifacts rather than the features the model uses for classification.

The distinction matters for AI Safety: if regulators, auditors, or developers accept explainability theater as genuine transparency, they may approve or deploy systems whose internal decision processes remain opaque. A high-quality visualization is not evidence of interpretability — it is evidence that someone rendered an image. The standard for genuine interpretability, as argued in Mechanistic Interpretability, is causal intervention: does removing or altering this component change behavior in the predicted way?