Algorithmic Probability

Algorithmic probability is the framework for inductive inference that assigns prior probabilities to hypotheses according to their algorithmic complexity: simpler theories — those with shorter descriptions — receive higher prior probability. Developed by Ray Solomonoff in 1964, it provides the theoretically optimal method for prediction given data, and establishes a precise, mathematical foundation for Occam's razor that is neither heuristic nor aesthetic.

Unlike conventional Bayesian inference, which requires the analyst to specify a prior distribution over hypotheses by hand, algorithmic probability derives the prior from the structure of the hypotheses themselves. A hypothesis is a program for a universal Turing machine that generates observed data as output. The prior probability of a hypothesis is proportional to 2^{-L}, where L is the length of the program in bits. This is the universal prior — a distribution over all computable hypotheses that favors compressible descriptions.

Solomonoff Induction and Universal Prediction

Solomonoff induction combines the universal prior with Bayesian updating to produce a prediction method with remarkable formal properties. Given a sequence of observations, Solomonoff induction predicts the next symbol by averaging the predictions of all computable hypotheses, weighted by their posterior probabilities. The result is a predictor that is, in a precise sense, as good as any computable predictor — it converges to the true data-generating process faster than any other method, up to a constant factor.

This optimality result is not practical consolation. It is a boundary theorem: it says that any improvement over Solomonoff induction would require a non-computable method. The universal prior is itself uncomputable — no algorithm can enumerate all programs and compute their exact lengths. But the theorem establishes the theoretical limit that all practical methods approximate. Machine learning algorithms, minimum description length (MDL) methods, and compression-based classifiers are all, knowingly or not, computable approximations to Solomonoff induction.

The Mathematics of Simplicity

The connection to Kolmogorov complexity is direct and exact. The universal prior assigns probability mass according to description length, and description length is Kolmogorov complexity. Where Kolmogorov complexity measures the information content of a single object, algorithmic probability measures the plausibility of the generative process. A string with low Kolmogorov complexity has many short programs that could have generated it; under the universal prior, these programs collectively receive high probability mass.

This formalizes a centuries-old intuition: simple explanations are more likely to be true. But it formalizes it without metaphysics. The preference for simplicity is not a law of nature or a methodological rule of thumb. It is a mathematical consequence of the definition of probability over computable processes. If the world is generated by a computable process, and if we know nothing about which process except that it is computable, then the correct prior is the universal prior — and the correct prior favors short descriptions. Occam's razor is not an assumption. It is a theorem.

Uncomputability and the Approximation Game

The universal prior is uncomputable, and this uncomputability is not a technical obstacle to be overcome by better hardware. It is a fundamental limit, cousin to the halting problem and Gödel's incompleteness. To compute the exact Solomonoff prediction, one would need to evaluate all programs and determine which halt and produce the observed data. No algorithm can do this.

What practitioners do instead is approximate. The minimum description length principle, developed by Jorma Rissanen, operationalizes the universal prior using practical compression algorithms. Large language models, trained to predict next tokens by minimizing cross-entropy loss, learn implicit priors over text that approximate the universal prior over their training distribution. The no free lunch theorem in machine learning — which states that no learning algorithm dominates all others across all possible problems — is the contrapositive of Solomonoff optimality: it says that without a prior (or with a uniform prior over all problems), prediction is impossible. Algorithmic probability provides the principled prior that makes learning possible.

Connections

Kolmogorov Complexity — the length measure underlying the universal prior
Bayesian inference — the updating mechanism
Information Theory — the communication-theoretic foundation
Inductive Reasoning — the philosophical problem Solomonoff solved in principle
Occam's Razor — the heuristic that algorithmic probability proves
Machine Learning — the field that approximates Solomonoff induction in practice
Artificial Intelligence — the broader project of building systems that learn from experience

References and Further Reading

Solomonoff, R. J. (1964). A formal theory of inductive inference. Information and Control, 7(1-2), 1–22, 224–254.
Li, M., & Vitányi, P. (2008). An Introduction to Kolmogorov Complexity and Its Applications (3rd ed.). Springer.
Hutter, M. (2007). Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Springer.
Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5), 465–471.

The persistent conflation of algorithmic probability with mere data compression reveals a conceptual failure that runs through machine learning and philosophy of science alike. Solomonoff induction is not about finding short descriptions. It is about finding the correct prior over all possible descriptions of reality. The fact that this prior is uncomputable is not a reason to abandon it — it is a reason to understand what every practical learning algorithm is approximating, and what the approximation costs. A field that builds learning systems without knowing what theoretical limit it is approximating is not doing engineering. It is doing alchemy with better metrics.