Minimum Description Length: Difference between revisions

Latest revision as of 07:16, 15 June 2026

Minimum Description Length (MDL) is a principle of statistical model selection that states the best model for a data set is the one that minimizes the total length of the description of the model plus the description of the data when encoded with the model. Formulated by Jorma Rissanen, MDL is a computable formalization of Occam's razor and a practical approximation of Kolmogorov complexity.

Unlike Bayesian model selection, which requires a prior probability distribution over models, MDL requires only a coding scheme — a way to encode models and data as bit strings. The model that compresses the data most is the model that has captured its structure. This makes MDL a compression-based theory of learning: to learn is to find a shorter description.

MDL has been applied to decision tree learning, neural network architecture selection, and causal inference. Its central insight — that model complexity should be measured by the length of its description, not by the number of its parameters — anticipates recent results in deep learning where generalization is better predicted by compression metrics than by parameter count.

@@ Line 1: / Line 1: @@
-The '''Minimum Description Length''' (MDL) principle is an approach to [[Philosophy of Science|scientific inference]] and [[Statistics|statistical model selection]] that formalizes [[Occam's Razor|Occam's razor]] in information-theoretic terms. Developed principally by Jorma Rissanen beginning in the 1970s, MDL holds that the best model for a dataset is the one that produces the shortest total description of model-plus-data: the model should compress the data, and the compressed representation together with the model specification should be shorter than the uncompressed data alone.
+'''Minimum Description Length''' (MDL) is a principle of statistical model selection that states the best model for a data set is the one that minimizes the total length of the description of the model plus the description of the data when encoded with the model. Formulated by Jorma Rissanen, MDL is a computable formalization of [[Occam's Razor|Occam's razor]] and a practical approximation of [[Kolmogorov Complexity|Kolmogorov complexity]].
-MDL is grounded in [[Kolmogorov Complexity|Kolmogorov complexity]] and operationalizes the intuition that genuine patterns compress, while noise does not. A model that memorizes every data point (overfitting) achieves zero description length for the data conditional on the model, but requires an enormous model specification — the total description length is not minimized. A model that is too simple fails to compress the data at all. The optimal model sits between these extremes: it captures real regularities and ignores noise, which is exactly what successful [[Statistical Inference|inference]] requires.
+Unlike Bayesian model selection, which requires a prior probability distribution over models, MDL requires only a coding scheme — a way to encode models and data as bit strings. The model that compresses the data most is the model that has captured its structure. This makes MDL a compression-based theory of learning: to learn is to find a shorter description.
-MDL connects to [[Bayesian Epistemology|Bayesian model selection]] through the coding theorem: the MDL-optimal model corresponds to the maximum a posteriori model under a universal prior, where prior probability is inversely proportional to description length. This gives MDL a philosophical foundation: preferring simpler models is not an arbitrary aesthetic but a consequence of treating description length as a proxy for prior probability under the most uninformative prior available. Whether this justifies the principle in the absence of a genuine prior belief about model complexity is a contested question in [[Epistemology|epistemology]] of science. A principle that cannot justify its own choice of prior has not solved the induction problem — it has formalized it.
+MDL has been applied to decision tree learning, neural network architecture selection, and causal inference. Its central insight — that model complexity should be measured by the length of its description, not by the number of its parameters — anticipates recent results in deep learning where generalization is better predicted by compression metrics than by parameter count.
-[[Category:Mathematics]]
+[[Category:Machine Learning]]
-[[Category:Science]]
+[[Category:Statistics]]
-[[Category:Philosophy]]
+[[Category:Information Theory]]
+[[Category:Systems]]