Why ML Benchmarks Shouldn't Have Worked—and Why They Did Anyway

Moritz Hardt, Director at the Max Planck Institute for Intelligent Systems in Tübingen, has published an open-access book that may be the most rigorous theoretical treatment of machine learning benchmarks to date. "The Emerging Science of Machine Learning Benchmarks," available at mlbenchmarks.org, runs from the ImageNet era through today's LLM leaderboard wars. The central question: benchmarks, by classical statistical standards, should never have worked. Standard holdout theory demands test sets be sealed away, yet the ML community posted them freely online for millions of repeated evaluations. Hardt's project is to explain why, despite this, benchmarks became the primary engine of measurable progress in AI.

The theoretical thread traces to a collaboration Hardt joined at the Simons Institute for Theoretical Computer Science in Fall 2013, with researchers including Cynthia Dwork, Vitaly Feldman, and Avrim Blum — who, Hardt argues, was among the first to connect the framework of adaptive data analysis to ML benchmarks specifically. That framework explains how repeated benchmark use creates a feedback loop between models and test data, progressively undermining classical statistical guarantees — a phenomenon related to Freedman's paradox and the broader replication crisis in statistics. Hardt's answer to the benchmark paradox centers on social norms: because the community cared about model rankings rather than absolute accuracy scores, benchmarks delivered something reliable even when individual metrics did not replicate across datasets.

The LLM era complicates the picture sharply. Models now train on massive internet crawls, making it impossible to verify what test data a model has already seen. Hardt works through the consequences: multi-task benchmark instability, Goodhart's Law dynamics — where optimizing the measure breaks the measure — and performativity, the process by which benchmarks actively reshape the systems they are supposed to evaluate. Most unsettling is what happens when models surpass human evaluators, collapsing the ground-truth assumptions on which every benchmark ultimately depends.

That last problem has already escaped academia. MMLU scores appear in tech CEOs' shareholder presentations. When DeepSeek R1 matched OpenAI o1 on reasoning benchmarks, the reaction moved global stock indices — a concrete illustration of performativity operating at geopolitical scale. A hardcover edition from Princeton University Press is expected in 2026, and the book was already in use in graduate courses at Tübingen during Fall 2024 and Fall 2025. The deeper question Hardt leaves open is whether any benchmark can survive the thing it was built to measure.