Random forest

Random forest is an ensemble learning method that constructs a large collection of decorrelated decision trees and aggregates their predictions through bootstrap aggregating — a combination that transforms a high-variance, unstable base learner into a robust, generalizable predictor. Invented by Leo Breiman in 2001, it extends the bagging framework by injecting a second source of randomness: at each split in each tree, only a random subset of features is considered, rather than the full set. This seemingly minor modification produces a qualitative change in behavior, making the method one of the most reliable and widely deployed algorithms in applied machine learning.

Algorithm and Mechanism

The construction of a random forest proceeds in two parallel randomization steps. First, each tree is trained on a bootstrap sample — a random subset of the training data drawn with replacement, typically the same size as the original dataset. This is the bagging step, and it ensures that each tree sees a slightly different distribution of examples, so their errors are uncorrelated across the ensemble. Second, at each node split, the algorithm considers only a random subset of features — typically the square root of the total number for classification, or one-third for regression. This feature randomization is the critical innovation: it prevents a few dominant features from being selected in every tree, which would cause the trees to be correlated and the variance reduction from averaging to be weakened.

The prediction rule is simple. For classification, each tree votes for a class label, and the forest returns the majority vote. For regression, each tree predicts a numerical value, and the forest returns the average. The ensemble prediction is therefore a democratic process — no single tree is authoritative, and the wisdom of the crowd emerges from the statistical regularity of aggregation. This is not merely a heuristic but a mathematical consequence: the variance of the ensemble prediction decreases as the correlation between trees decreases, and feature randomization is the mechanism that enforces this decorrelation.

Properties and Theoretical Guarantees

Random forests possess a remarkable combination of properties that make them theoretically interesting and practically indispensable. They require almost no hyperparameter tuning: the number of trees, the number of features per split, and the tree depth are the only major parameters, and the method is robust to suboptimal choices. They handle mixed data types (continuous, categorical, ordinal) without preprocessing. They provide reliable estimates of variable importance — a measure of how much each feature contributes to prediction accuracy. And they offer an internal validation mechanism through the out-of-bag error: each tree is tested on the examples it did not see during bootstrap training, and the aggregated out-of-bag predictions provide an unbiased estimate of generalization error without requiring a separate validation set.

The theoretical analysis of random forests connects to deep results in statistical learning theory. The bias-variance decomposition of a random forest prediction can be expressed in terms of the correlation and strength of individual trees, and the method is known to converge as the number of trees increases under mild conditions. Recent work has connected random forests to kernel methods and to the geometry of partition-based estimation, revealing that the method implicitly constructs an adaptive kernel that varies across the input space. This connection is not merely aesthetic: it suggests that random forests are doing something more principled than empirical folklore, and that their success is grounded in the same mathematical structures that underlie more theoretically explicit methods.

Connections, Limitations, and the Deep Learning Shadow

Random forests occupy a curious position in the contemporary machine learning landscape. They were the dominant method for structured data problems throughout the 2000s and early 2010s, and they remain the benchmark against which new methods are measured in tabular data competitions. Yet the rise of deep learning has cast them into a relative shadow, and the research community has shifted its attention toward neural architectures. This is a mistake if it is taken to imply that random forests are obsolete. The evidence is unambiguous: on structured data, random forests and gradient boosting machines still outperform deep learning in the vast majority of practical settings, and they do so with orders of magnitude less computational cost and with predictions that are more interpretable.

The deeper limitation of random forests is not predictive but epistemological. Like all ensemble methods, they excel at prediction but resist explanation. The forest knows which features are important, but it does not know why. It captures correlations without articulating mechanisms. This is the same limitation that afflicts deep learning, and it is the central challenge of contemporary machine learning: how to build systems that are both accurate and accountable, both powerful and comprehensible. Random forests do not solve this problem, but they make it visible in a form that is more analytically tractable than the black box of neural networks. The transparency of a single tree is lost in the forest, but the statistical regularity of the ensemble is a different kind of transparency — one that reveals the structure of the data rather than the structure of the model.

The obsession with finding a single 'best' algorithm is a symptom of the same reductionism that ensemble learning was designed to cure. Random forests are not a stepping stone to deep learning; they are a co-evolved solution to a different problem, and the field's health depends on maintaining both lineages. The question is not which method wins but which problems each method is suited to solve, and how their combination — stacking forests with neural networks, using forests to preprocess features for deep learning — can produce systems that neither could achieve alone. The future of machine learning is not monoculture but polyculture, and random forests are one of the most reliable crops in the field.