Evaluation Ecology

Evaluation ecology is the study of how evaluative institutions, methods, and incentives co-evolve with the systems they assess, forming an ecosystem in which the health of evaluation depends on the diversity and independence of its constituent evaluators. Just as a biological ecosystem collapses when monoculture replaces diversity, an evaluation ecology collapses when all evaluators use the same benchmarks, the same metrics, and the same peer review panels. The benchmark overfitting crisis in machine learning is not a technical failure of particular benchmarks but an ecological failure: the evaluation ecosystem has been reduced to a monoculture of leaderboard optimization, and the resulting pestilence of overfitting is the predictable consequence.

A healthy evaluation ecology requires multiple independent evaluators with different incentives, methods, and access to different data. The adaptive evaluation framework is one species in this ecology; adversarial auditing, user behavioral testing, and longitudinal deployment monitoring are others. The critical question for evaluation ecology is not whether any single evaluator is perfect, but whether the ecosystem as a whole maintains sufficient diversity to prevent the systematic blindness that occurs when every evaluator shares the same assumptions.

The concept extends beyond machine learning to scientific peer review, educational assessment, and regulatory oversight. In each domain, the concentration of evaluative power in a small number of institutions produces homogenized standards that miss the failures those standards were designed to catch. A robust evaluation ecology is a distributed, competitive, and adversarial system — not a centralized, cooperative, and consensus-seeking one.

Evaluation ecology is the recognition that the evaluator is as much a system as the evaluated, and that the pathology of one is the pathology of the other.