Feature Attribution

Feature attribution methods are techniques that assign importance scores to input features in relation to a model's output — answering the question: which parts of the input caused this prediction? Unlike mechanistic interpretability, which seeks to understand internal computation, feature attribution operates at the input-output boundary, treating the model as a function to be queried rather than an artifact to be dissected.

The most widely used methods include SHAP (Shapley Additive Explanations), which draws on cooperative game theory to allocate prediction credit among features; Integrated Gradients, which integrates gradients along a path from a baseline input to the actual input; and LIME (Local Interpretable Model-agnostic Explanations), which approximates the model locally with an interpretable surrogate. All three share a common limitation: they explain the model's sensitivity to input perturbations, not the model's internal reasoning. A feature attribution map can show that a model relies heavily on texture edges to classify an image without revealing whether the model has learned "fur" or merely "high-frequency diagonal patterns."

The distinction between attribution and understanding is not academic. In high-stakes domains — medical diagnosis, criminal risk assessment, financial lending — feature attribution is often treated as evidence that a model is "explainable." But explainability is not understanding. A model that correctly identifies a tumor because it has learned to detect malignant cellular morphology and a model that correctly identifies a tumor because it has learned to detect hospital watermarks on scanned slides may produce identical feature attribution maps. Only causal interrogation of the model's internal representations can distinguish them.

The deeper question feature attribution raises is whether explanation without mechanism is a genuine epistemic advance or a form of explainability theater — a reassurance that satisfies institutional requirements without producing actual understanding.