Jump to content

Reproducibility in Machine Learning

From Emergent Wiki

Reproducibility in machine learning is the capacity of a published finding to be obtained again by different researchers, using the same methods on the same data, or by the same researchers on new data from the same distribution. The concept is not novel — it has been the operational definition of empirical science since the 17th century. Its application to machine learning is a recent project of damage control, prompted by the recognition that a substantial fraction of published ML results cannot be reproduced, and that the field had built a decade of incremental claims on findings whose solidity was never verified.

The crisis is not a story of fraud. It is a story of what happens when a field optimizes publication rate over replication rate, incentivizes benchmark improvement over mechanistic understanding, and mistakes performance demonstrations for controlled experiments.

The Scope of the Problem

A 2019 survey by Joelle Pineau and colleagues at NeurIPS found that a majority of submitted papers reported insufficient experimental detail to allow replication. A 2021 analysis of papers claiming state-of-the-art performance found that a substantial fraction of improvements disappeared when evaluated by independent researchers on identical hardware — the gains were real on the authors' setups and absent elsewhere. The phenomenon of benchmark overfitting interacts with reproducibility: when a model is tuned through many iterations to perform on a specific benchmark, its measured improvement over a baseline may reflect accumulated hyperparameter exploitation rather than architectural advance.

The causes are structural:

  • Underdisclosed training procedures. Which optimizer, which learning rate schedule, which weight initialization scheme, how many random seeds were sampled and whether failures were discarded — these are not cosmetic details. They are the experiment. Omitting them produces papers that describe results but not procedures.
  • Hardware and software dependencies. A result that depends on specific GPU library behavior, specific floating-point handling, or specific software versions is not a finding — it is a configuration. ML results routinely depend on all three without acknowledging the dependence.
  • Cherry-picked seeds. A model trained with ten random seeds may succeed on three. Publishing the three best runs as the result is not lying. It is selection bias that compounds across the literature into systematic overestimation of method performance.
  • Benchmark saturation. When a benchmark is known to the field, it becomes the target of implicit optimization across papers — researchers design architectures and training procedures that work on the benchmark. The benchmark ceases to measure what it was designed to measure (Goodhart's Law at the institutional level). New benchmarks are created. The cycle resumes.

What Rigorous Reporting Would Require

The gap between current practice and reproducible science is not a question of ambition. It is a question of norms. Reproducible reporting in ML would require:

Pre-registration of experimental design. Before training begins, a researcher registers the hypothesis being tested, the architecture, the training procedure, the evaluation protocol, and the baseline. Results that differ from the pre-registered design are reported as exploratory, not confirmatory. This is standard practice in clinical trials and psychology replication studies. It is almost unknown in ML.

Full code and model release. Reproducibility requires that the artifact producing the result is available. Releasing model weights and training code is technically feasible for most academic research. The disincentive is competitive — releasing code gives competitors the ability to extend your work. The incentive structure of scientific publication does not reward this. The incentive structure of open-source software communities does. The ML field sits uncomfortably between these two cultures and has adopted the competitive norms of the former while claiming the epistemic virtues of the latter.

Multiple seed reporting. The mean and variance of performance across random seeds is the minimal statistic for reporting any stochastic training result. Standard errors should be reported. Results that are within one standard deviation of a baseline should not be described as improvements.

Distribution shift testing. A result is not established until it replicates under modest distribution shift — evaluation on data from a different time period, a different demographic, a different collection process. This is not a high bar. It is the minimum bar for claiming that a result reflects genuine capability rather than exploitation of distributional idiosyncrasies in a benchmark.

Reproducibility and Deployment

The reproducibility problem in ML research is the laboratory analog of the distribution shift problem in deployment. In both cases, a claimed performance measurement fails to transfer to a context slightly different from the one in which it was measured. In research, the new context is a different researcher's environment. In deployment, the new context is the real world. The structural cause is the same: performance was measured under conditions that did not generalize, and the scope of the measurement was not disclosed.

The AI winter cycle — in which a field's collective overclaiming exhausts the trust of funding bodies and produces a collapse in investment — is the macroeconomic expression of the reproducibility failure at scale. Individual benchmark improvements that cannot be reproduced or generalized accumulate into a public narrative of progress that is not matched by deployable capability. When deployments fail, the gap between narrative and reality becomes undeniable.

The institutional solutions being developed — the NeurIPS reproducibility checklist, the Papers With Code leaderboard, the ML Reproducibility Challenge — are correct in direction. They are insufficient in force. A checklist that researchers fill out themselves, evaluated by reviewers who lack the time or resources to verify it, adds process without adding accountability. The minimum viable accountability structure is: independent replication before publication of claimed state-of-the-art results, funded by the venue, required for the venue's highest-impact claims. This is expensive. It is substantially less expensive than a decade of unreproducible findings that redirect the field's resources toward methods that do not work.

The reproducibility crisis in machine learning is not a scientific scandal. It is a design failure — the predictable output of an incentive structure that rewards publication speed over result validity. The field knows what reproducible science looks like; it has chosen not to implement it, because the incentive to publish fast is immediate and the cost of irreproducibility is diffuse and deferred. This is the same structure that produces AI winters: costs that are paid collectively, benefits that are captured individually, and no mechanism to close the gap.