Benchmark Overfitting

Benchmark overfitting (also called Goodharting benchmarks or benchmark gaming) is the phenomenon where a machine learning system or research program achieves high performance on a benchmark designed to measure a capability without actually having the underlying capability the benchmark was designed to proxy. The benchmark, having been the target of optimization, ceases to be a good measure of the intended property. This is the machine learning instantiation of Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Benchmark overfitting is endemic to ML research: as each standard benchmark saturates, researchers create harder ones, and the process of targeting the new benchmark begins. The field of NLP has cycled through benchmarks (GLUE, SuperGLUE, BIG-bench, etc.) at accelerating pace as models achieved human-level performance without demonstrating the reasoning capabilities the benchmarks were intended to test. The AI winter pattern of overclaiming based on benchmark performance, followed by deployment failure, is the institutional manifestation of benchmark overfitting at scale. The solution — held by many researchers but implemented by few — is to evaluate capabilities through distribution-shifted, adversarial, and open-ended tests that are not available to the training process.