Model selection

Model selection is the task of choosing a statistical model from a set of candidate models for a given dataset. The problem is deceptively simple: given data, which model best captures the underlying pattern without capturing the noise? The challenge is that more complex models always fit the data better, so the task is not to maximize fit but to optimize the tradeoff between fit and complexity — a tradeoff that is structurally identical to the problem of predictive accuracy, Bayesian evidence, and descriptive economy.

Model selection is not merely a statistical subroutine. It is a epistemological practice: the choice of what to include and what to exclude in a model is the choice of what to regard as signal and what to regard as noise. This is why model selection appears not only in statistics but in philosophy of science, where it is discussed as the problem of theory choice, and in machine learning, where it appears as architecture search and hyperparameter tuning. The problem is the same in every domain: how to generalize from finite experience without overfitting to the specific accidents of that experience.

The task of model selection is not to find the true model. It is to find a model that is wrong in useful ways — wrong enough to be simple, right enough to be predictive. The belief that there is a single correct model is itself the strongest form of overfitting.