Gini Impurity

Gini Impurity is a measure of statistical dispersion used in the construction of decision trees. For a set of items belonging to C classes, the Gini impurity is defined as G = 1 − Σ(pᵢ)², where pᵢ is the fraction of items labeled with class i. A set containing only one class has Gini impurity 0 (pure); a set with classes in equal proportion has maximum impurity. The measure quantifies the probability that a randomly chosen element would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset.

Gini impurity is closely related to information gain (which uses entropy) and often produces similar splits. The choice between Gini and entropy is typically a matter of computational convenience: Gini is slightly faster to compute because it avoids the logarithm. But the deeper difference is conceptual. Entropy is derived from information theory; Gini is derived from economics. The fact that both work suggests that the choice of impurity measure is less important than the tree-growing algorithm's capacity to find good splits. The impurity measure is a lens, not a law.