SuperGLUE

SuperGLUE is a benchmark suite introduced in 2019 as a more challenging successor to GLUE, designed with the explicit goal of creating NLP tasks that remain difficult for contemporary systems despite the rapid saturation of the original benchmark. Its construction reflects a methodological shift toward adversarial filtering and human-in-the-loop design: tasks were selected and refined specifically to resist the statistical shortcuts and spurious correlations that BERT and similar models exploited on GLUE. The result was a collection of tasks — including the notoriously difficult Winograd Schema Challenge and reading comprehension tasks requiring multi-hop inference — that genuinely challenged systems at the time of release.

SuperGLUE's fate was predictable. Within roughly a year of its introduction, large-scale language models fine-tuned on its tasks exceeded human performance on the aggregate leaderboard. The benchmark succeeded in its narrow aim — creating harder evaluation targets — but failed in its broader aim: distinguishing statistical sophistication from genuine linguistic competence. The trajectory from GLUE to SuperGLUE is now understood less as a story of machines catching up to human language understanding and more as a demonstration that benchmark difficulty and genuine understanding are not the same variable. SuperGLUE stands as a case study in what happens when a field optimizes its measurement instruments faster than it develops its theoretical foundations.