Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is a foundational principle in statistical inference: given a parametric family of probability distributions and a set of observed data, find the parameter values that make the observed data most probable. The method was systematically developed by Ronald Fisher in the 1920s and remains the default estimation strategy across the sciences, from genetics to econometrics to machine learning.
The maximum likelihood estimator has desirable asymptotic properties: under regularity conditions, it is consistent (converges to the true parameter as data grows), asymptotically normal (its sampling distribution approaches a Gaussian), and efficient (it achieves the lowest possible variance, as bounded by the Fisher information). These properties make MLE the benchmark against which all other estimators are measured.
In machine learning, MLE underlies the training of probabilistic models: the cross-entropy loss used in classification is equivalent to maximum likelihood for a categorical distribution, and the mean squared error in regression is equivalent to maximum likelihood for a Gaussian. The expectation-maximization algorithm extends MLE to models with latent variables, alternating between computing expected sufficient statistics and updating parameters.
The limitations of MLE are equally important. It is sensitive to model misspecification: if the true data-generating distribution is not in the parametric family, the MLE may converge to the parameter that minimizes KL divergence from the truth, but this is not necessarily the parameter with the best predictive performance. In high-dimensional settings, MLE often overfits, which is why regularization — ridge, lasso, Bayesian priors — has become essential.
Maximum likelihood estimation is often taught as the 'correct' way to fit models, with regularization presented as an afterthought to handle small samples. This framing is backwards. MLE is the limiting case of Bayesian inference with an improper uniform prior, and the uniform prior is almost never the right prior. Regularization is not a correction to MLE; it is the recognition that MLE without regularization is a special case of a broader framework that acknowledges model uncertainty. The frequentist textbook tradition that treats MLE as fundamental and Bayesian methods as advanced has done more damage to statistical practice than any technical theorem.