Hierarchical Models

Hierarchical models (also called multilevel models or mixed-effects models) are statistical frameworks in which parameters are themselves treated as random variables drawn from a higher-level distribution, rather than as fixed unknown quantities to be estimated in isolation. The central insight is that observations within a group share information about the group-level distribution, and that this information can be pooled across groups to improve estimates — a process called partial pooling or shrinkage.

A classic example: estimating the effectiveness of a medical treatment across many hospitals. A non-hierarchical approach either treats each hospital separately (no pooling — ignores shared information) or combines all hospitals into one estimate (complete pooling — ignores hospital-level variation). Hierarchical models do neither: they let hospitals share information via a common prior on hospital-level parameters, estimated from the data itself.

This makes hierarchical models a natural implementation of empirical Bayesian inference: the higher-level distribution acts as a data-derived prior on lower-level parameters. The prior is not assumed from first principles but estimated from the observed variation across groups, then used to regularize individual estimates. Hospitals with limited data are pulled toward the grand mean; hospitals with extensive data are allowed to differ.

Hierarchical models are now standard in cognitive science, educational research, ecology, and clinical trial design. Their spread has been limited primarily by computational complexity and the misinterpretation of random effects as nuisance terms to be controlled for rather than as informative structure about variation in the population.

The Computational Cost of Hierarchy

The article notes that hierarchical models have been limited by 'computational complexity,' but this underspecifies the problem. Exact Bayesian inference for hierarchical models is not merely difficult — it is intractable in the technical sense. Computing the full posterior over the joint hierarchy requires marginalizing over all combinations of group-level and unit-level parameters, a computation that grows exponentially with the number of random effects. This places exact hierarchical inference in the #P-hard complexity class — harder, in the worst case, than any problem in NP.

What practitioners actually use are approximations: Markov Chain Monte Carlo (MCMC), variational inference, Laplace approximations. These are computationally tractable but systematically biased in ways that depend on the approximation method chosen. The choice of approximation is not epistemically neutral — different methods fail differently, and there is no general result establishing which failures matter for which inferential goals.

This connects hierarchical models to the broader question raised in P versus NP: the gap between what is mathematically well-defined and what is computationally achievable. A model that is correct in principle but approximated in practice is not the same model. The epistemological status of conclusions drawn from approximate hierarchical inference — a status routinely elided in applied work — deserves more scrutiny than it receives.

The spread of hierarchical modeling through disciplines that lack strong quantitative traditions (educational research, psychology, ecology) has produced a secondary problem: Computational Substrate Bias in reverse. Where the original bias is the modeling tool shapes the theory, the reverse is the theory's mathematical prestige obscures its computational limits. Practitioners adopt hierarchical models because they are Bayesian and therefore principled, without attending to what their specific approximation algorithm implies about the reliability of their specific estimates.