Model Selection

Model selection is the problem of choosing among a set of candidate models — mathematical descriptions of data-generating processes — the one that best balances fidelity to observed data against complexity, interpretability, and generalization to unseen cases. The problem is not merely technical. It is the point where statistical methodology, philosophy of science, and computational practice converge, because the choice of model is never fully determined by the data. The data underdetermine the model, and the principles used to resolve this underdetermination reveal the chooser's commitments about what good explanations look like.

The classical frameworks for model selection include penalized likelihood methods — the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) — which add complexity penalties to the log-likelihood to discourage overfitting. AIC estimates the expected Kullback-Leibler divergence from the true distribution; BIC approximates the Bayesian marginal likelihood under a unit-information prior. The two criteria often select different models, and the choice between them encodes a philosophical divide: AIC is consistent in minimizing prediction error even when the true model is not in the candidate set, while BIC is consistent in recovering the true model when it is present. Whether one cares more about prediction or truth is not a question the data can answer.

Prediction, Truth, and the Two Cultures

The divide between prediction-oriented and truth-oriented model selection mirrors a deeper split in the philosophy of inference. The machine learning tradition, exemplified by cross-validation and regularization methods like LASSO, treats models as predictive instruments. A model is good if it generalizes to new data, and the internal structure of the model matters only instrumentally. The Bayesian tradition, by contrast, treats model selection as a problem of belief revision: compute the posterior probability of each model given the data, and choose the most probable. The Jeffreys prior and other objective Bayesian methods enter here as attempts to make the prior model probabilities neutral and defensible.

Both traditions have blind spots. The predictive tradition can select models that fit well but explain nothing — black boxes that interpolate without illuminating mechanism. The Bayesian tradition can be paralyzed by model misspecification: if the true model is not among the candidates, the posterior probabilities are probabilities over a fiction, and the most probable fiction is still a fiction. The minimum description length (MDL) principle, rooted in Kolmogorov complexity and information theory, attempts a synthesis: the best model is the one that compresses the data most efficiently, where compression is measured by the length of the description of the model plus the length of the data encoded with the model. MDL treats model selection as a coding problem, and in doing so it reveals that the trade-off between simplicity and fit is not a statistical heuristic but a theorem about optimal communication.

Model Selection and the Structure of Science

In practice, model selection is rarely a single decision. It is an iterative process of refinement, expansion, and revision — what philosophers of science call the scientific method when it is formalized. The choice of which variables to include, which functional forms to consider, and which interactions to allow is shaped by background knowledge, disciplinary conventions, and computational constraints. A neural network with millions of parameters and a linear regression with two parameters are not competitors in any meaningful sense; they inhabit different regions of model space, separated by assumptions about smoothness, locality, and compositionality that are neither derivable from the data nor fully subjective.

The Fisher information matrix enters model selection through the geometry of the model manifold. Models with high Fisher information in certain directions are sensitive to parameter changes in those directions; models with low Fisher information are effectively lower-dimensional. This geometric perspective reveals that model selection is not a choice among discrete alternatives but a navigation of a continuous space of descriptions, where the boundaries between 'different models' are themselves conventional.

Model selection is where the pretense of objectivity in statistical practice is most transparently exposed. Every criterion — AIC, BIC, MDL, cross-validation — encodes a value judgment about whether prediction, truth, compression, or stability matters most. The fantasy that the data will tell us which model to choose is the last refuge of methodological innocence. The data speak, but they speak in a grammar we provide, and the grammar is never neutral.