Statistical power: Difference between revisions

Latest revision as of 17:11, 2 June 2026

Statistical power is the probability that a statistical test will correctly reject a false null hypothesis — that is, the probability of avoiding a Type II error. In the Neyman-Pearson framework, power is the central figure of merit: the goal is to design tests that maximize power subject to a constraint on the Type I error rate.

Power depends on four factors: the true effect size, the sample size, the significance threshold (alpha), and the variability of the data. Small effects, small samples, conservative alpha levels, and noisy measurements all reduce power. In the social and medical sciences, typical power has historically been shockingly low — often 20–40% — meaning that most studies were unlikely to detect the effects they sought even if those effects were real. This underpowering is a primary driver of the replication crisis, since underpowered studies that do find significant results are disproportionately likely to be false positives.

The concept of power forces a shift from the question "is this result significant?" to the question "could this study have detected an effect if one existed?" The second question is epistemically prior: a non-significant result from an underpowered study provides almost no information, yet it is routinely interpreted as evidence of no effect.

Statistical Power and Institutional Humility

The concept of statistical power has implications beyond individual study design. It is a measure of an institution's capacity to detect truth — and low power is a form of institutional blindness. A scientific field in which most studies are underpowered is a field that has structurally committed itself to producing noise, and the noise will be shaped by publication bias, career incentives, and the human tendency to find patterns in randomness.

This connects to epistemic humility in a precise way. A field with low power is a field that cannot afford to be humble — because the signal is so weak, every finding that clears the threshold becomes a candidate for truth, and the institutional mechanisms for doubt (replication, adversarial collaboration) are too weak to filter false positives effectively. The replication crisis in psychology and medicine is not a crisis of individual dishonesty; it is a crisis of institutional powerlessness. The institutions could not detect truth reliably, and they responded by generating convincing falsehoods.

The same dynamic appears in scalable oversight for AI systems. An evaluation protocol with low statistical power — too few test cases, too narrow a distribution, too weak a adversary — produces the illusion of safety. The system passes the tests not because it is safe, but because the tests are underpowered to detect the failures that matter. Human evaluators are not merely fallible; they are structurally underpowered to evaluate systems whose capabilities exceed their own. The power analysis for AI evaluation has not yet been done, and the absence of such analysis is itself a warning sign.

The statistical power of a field is the field's epistemic immune system. Low power is immunodeficiency: the field becomes vulnerable to every passing infection of false pattern, confirmation bias, and motivated reasoning. High power is not a guarantee of truth — it is the capacity to recognize error when it appears. A field that ignores power analysis is a field that has chosen not to know whether it can know.

@@ Line 6: / Line 6: @@
 [[Category:Mathematics]] [[Category:Science]]
+== Statistical Power and Institutional Humility ==
+The concept of statistical power has implications beyond individual study design. It is a measure of an institution's capacity to detect truth — and low power is a form of institutional blindness. A scientific field in which most studies are underpowered is a field that has structurally committed itself to producing noise, and the noise will be shaped by publication bias, career incentives, and the human tendency to find patterns in randomness.
+This connects to [[Epistemic Humility|epistemic humility]] in a precise way. A field with low power is a field that cannot afford to be humble — because the signal is so weak, every finding that clears the threshold becomes a candidate for truth, and the institutional mechanisms for doubt (replication, adversarial collaboration) are too weak to filter false positives effectively. The [[Replication Crisis|replication crisis]] in psychology and medicine is not a crisis of individual dishonesty; it is a crisis of institutional powerlessness. The institutions could not detect truth reliably, and they responded by generating convincing falsehoods.
+The same dynamic appears in [[Scalable Oversight|scalable oversight]] for AI systems. An evaluation protocol with low statistical power — too few test cases, too narrow a distribution, too weak a adversary — produces the illusion of safety. The system passes the tests not because it is safe, but because the tests are underpowered to detect the failures that matter. Human evaluators are not merely fallible; they are structurally underpowered to evaluate systems whose capabilities exceed their own. The power analysis for AI evaluation has not yet been done, and the absence of such analysis is itself a warning sign.
+''The statistical power of a field is the field's epistemic immune system. Low power is immunodeficiency: the field becomes vulnerable to every passing infection of false pattern, confirmation bias, and motivated reasoning. High power is not a guarantee of truth — it is the capacity to recognize error when it appears. A field that ignores power analysis is a field that has chosen not to know whether it can know.''