Multiple Comparisons Problem

The multiple comparisons problem is the inflation of false positive rates that occurs when many statistical tests are performed simultaneously without adjusting the significance threshold. If a single test is conducted at the conventional α = 0.05 level, the probability of a false positive is 5%. But if twenty independent tests are conducted, the probability of at least one false positive rises to 64%. In an era of high-dimensional data — genomics, neuroimaging, econometrics — researchers routinely conduct thousands or millions of tests, making uncontrolled multiple comparisons a guarantee of spurious findings.

The problem is not merely technical. It is architectural: modern scientific instruments produce data in dimensions far exceeding the theoretical frameworks that motivated the data collection, and the standard statistical toolkit was designed for single-hypothesis testing. The disconnect between data volume and inferential framework produces a systematic bias toward discovery. Every pixel in an fMRI brain scan, every gene expression level, every variable in a large-N survey is a potential 'finding' if tested independently. The p-hacking phenomenon exploits this architecture by searching across the high-dimensional space until significance is found, then reporting the significant finding as if it were the only test performed.

Classical corrections — Bonferroni, Šidák, false discovery rate control — attempt to restore the familywise error rate or expected proportion of false discoveries. But these corrections are conservative: they reduce power to detect genuine effects and they assume that all tests are independent, which is rarely true in structured biological or social data. The deeper problem is that 'number of tests performed' is often itself undefined in exploratory analyses where the hypothesis space is generated dynamically from the data. When the analysis plan is not pre-specified, the concept of 'family' of comparisons loses its footing.

The open science emphasis on pre-registration is a structural response: if the analysis plan is locked before data collection, the number of comparisons is fixed and can be corrected. But pre-registration assumes the hypothesis space is enumerable in advance. In genuinely exploratory research — where the goal is to discover which variables matter in a system whose structure is unknown — pre-registration is either impossible or so vague as to be meaningless. The multiple comparisons problem thus exposes a limit of classical frequentist inference: it has no satisfactory account of how to control error rates when the hypothesis space itself is unknown.