METR's research note, published March 10, 2026, tests a question the benchmark leaderboards don't answer: when AI-generated code passes SWE-bench Verified, would a real maintainer actually merge it? The team recruited four active maintainers from scikit-learn, Sphinx, and pytest and had them review 296 pull requests produced by agents running on Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 4 Opus, Claude 4.5 Sonnet, and GPT-5. The automated grader scored those PRs about 24 percentage points higher, on average, than the maintainers did. Roughly half of all benchmark-passing submissions would have been rejected.
The researchers also gave maintainers 47 human-written PRs that had already been merged; those scored 68%, and model results were normalized against that figure. Comparing grader-based improvement to maintainer-based improvement across models tested — covering mid-2024 through mid/late-2025 — reveals a growing gap of about 9.6 percentage points per year. The reasons for rejection were practical rather than arcane: core functionality that didn't work, patches that broke unrelated parts of the codebase, and code that didn't follow each project's conventions. Automated test suites catch regressions; they don't catch those.
METR is careful about what the data does and doesn't show. Every agent in the study got a single attempt, with no reviewer feedback to iterate on — that's not how developers actually move a PR through review. The researchers don't frame the gap as a hard ceiling; better prompting, stronger elicitation, and human review loops could recover a significant portion of it. Their core claim is narrower: a SWE-bench score isn't a reliable proxy for how useful an agent is in practice.
SWE-bench Verified became the standard coding benchmark because it offered something most evaluations don't — real tasks, runnable tests, and a reproducible number. Labs published leaderboard results as evidence of productivity gains, and the underlying capability improvements were genuine. What this study adds — building on an earlier METR analysis of just 18 tasks, this is now the most thorough examination of the question — is a clearer sense of how much the benchmark flatters. Passing tests and writing code a maintainer will merge are different things, and the distance between them is larger than the scores have implied.