Information gain

Information gain is the reduction in entropy — or more generally, in expected uncertainty — achieved by partitioning a dataset according to the values of an attribute. It is the criterion that the ID3 algorithm and its successors use to select which feature to test at each node of a decision tree. The feature that maximizes information gain is the one that, if known, would leave the remaining distribution most concentrated, most predictable, least surprised.

The concept originates in Claude Shannon's information theory, where entropy measures the average information content of a message source. In machine learning, information gain repurposes this measure to quantify the discriminative power of a feature. A feature that perfectly separates the classes has maximum information gain; a feature that is statistically independent of the class label has zero information gain. The asymmetry matters: information gain measures how much a feature tells you about the class, not how much the class tells you about the feature.

The standard formulation has a known bias toward features with many possible values, which is why C4.5 replaces raw information gain with gain ratio, a normalized variant that penalizes high-cardinality attributes. But the deeper limitation is conceptual. Information gain assumes that all uncertainty is reducible by observation — that there is always a feature that, if tested, would sharpen the prediction. This is not true in domains where the relevant information is distributed across many weak signals, none of which is individually informative. In such regimes, ensemble methods that aggregate marginal gains outperform any single greedy split.