Feature selection

Feature selection is the process of choosing a subset of relevant features from a larger set of variables in a dataset, with the goal of improving model performance, reducing computational cost, and increasing interpretability. It is the machine learning analogue of filter theory: where a signal filter removes frequency components deemed irrelevant, a feature selection algorithm removes dimensions of data deemed uninformative. The same mathematics of selective attenuation applies, but the domain is not frequencies but variables, and the criterion is not spectral power but predictive relevance.

Methods

Feature selection methods fall into three categories, each representing a different philosophy about what counts as 'relevance.'

Filter methods evaluate features independently of any learning algorithm, using statistical measures such as correlation, mutual information, or chi-squared tests. A filter method treats feature selection as a pre-processing step: it ranks features by their individual relationship to the target variable and selects the top k. This is the conservative approach, analogous to the Butterworth filter: it preserves as much information as possible while removing obvious noise, but it cannot detect interactions between features.

Wrapper methods use a learning algorithm as a black-box evaluator, testing different subsets of features by training and evaluating the model on each subset. A wrapper method treats feature selection as a search problem: it explores the space of feature subsets, guided by the model's performance. This is the aggressive approach, analogous to the elliptic filter: it achieves the best possible performance for a given model, but at the cost of computational expense and the risk of overfitting to the evaluation protocol.

Embedded methods perform feature selection during model training, integrating the selection criterion into the learning objective. The LASSO regression penalizes the absolute value of coefficients, driving some to zero and effectively removing the corresponding features. This is the elegant approach, analogous to the Chebyshev filter: it accepts a controlled amount of distortion (bias) in exchange for a simpler model, and the trade-off is governed by a single parameter that can be tuned.

The Curse of Dimensionality

The need for feature selection arises from the curse of dimensionality: as the number of features grows, the volume of the data space grows exponentially, and the data become sparse. A model trained on high-dimensional data will overfit — it will learn the noise in the training set rather than the underlying pattern — because the number of possible hypotheses grows faster than the number of data points. Feature selection is the dimensionality reduction strategy that preserves the original feature meanings, unlike principal component analysis, which creates new synthetic features. The filter is not a transformation; it is a selection.

Feature Selection as Epistemology

The philosophical dimension of feature selection is often overlooked. When a data scientist selects features, they are making a claim about what matters in the data. The claim is not merely statistical; it is causal. A feature selected by a wrapper method is not just correlated with the target; it is treated as a candidate cause, a variable whose manipulation would change the outcome. But correlation is not causation, and feature selection methods do not distinguish between them. A feature that is selected because it is a proxy for the true cause will be included, and the model will perform well until the proxy fails.

This is the epistemic trap of feature selection: the algorithm tells you which features are useful, but it does not tell you why. The wrapper method that selects a feature because it improves accuracy is performing a pragmatic selection, not a structural one. The filter method that selects a feature because it has high mutual information is performing a statistical selection, not a causal one. Neither method can distinguish between a cause and its symptom, and in high-stakes domains — medicine, finance, criminal justice — that distinction matters.

Feature selection is not a technical step in a machine learning pipeline. It is a theory of what counts as evidence, and like all theories, it is biased. The filter method assumes that individual relevance is sufficient; the wrapper method assumes that model performance is the criterion; the embedded method assumes that sparsity is desirable. None of these assumptions is universally true. The choice of feature selection method is the choice of an epistemological framework, and most practitioners make that choice without knowing they are doing epistemology. The mathematics of feature selection is not neutral; it is a filter that selects not only features but also the kinds of questions that can be asked about them.