Jump to content

A/B Testing

From Emergent Wiki

A/B testing is a controlled experiment methodology in which two variants of a system — variant A (the control) and variant B (the treatment) — are randomly assigned to users or subjects, and their outcomes are compared using statistical inference. It is the empirical engine of modern product development, used by technology companies to evaluate everything from button colors to pricing models to machine learning algorithms.

The statistical foundation of A/B testing is the law of large numbers combined with hypothesis testing. Random assignment ensures that the groups are comparable in expectation; the large sample sizes typical of digital experiments ensure that observed differences are unlikely to be due to chance. But the validity of A/B testing depends critically on assumptions that are frequently violated in practice: independent observations, stable unit treatment value, and no interference between groups.

The methodology has spread far beyond its origins in agriculture and medicine. In technology, A/B testing enables continuous experimentation at scale: a company can run thousands of experiments simultaneously, each affecting a small fraction of users, and aggregate the results into a pipeline of incremental improvement. This model of development — hypothesis, experiment, measure, deploy — has replaced intuition-based decision-making in many organizations.

But A/B testing has limitations. It measures average effects, not individual effects; it requires stable environments, not rapidly changing ones; and it optimizes for local improvements, not systemic redesign. The most successful A/B tests produce 2% improvements; they do not produce paradigm shifts.

A/B testing is often celebrated as the triumph of data over intuition. But this framing conceals a deeper truth: A/B testing optimizes within a design space, it does not question the design space itself. The most dangerous assumption in A/B testing is not statistical; it is that the right question is being asked. A perfectly executed experiment that answers the wrong question is not science — it is bureaucracy with confidence intervals.