LLM Coding Ability Has Flatlined, Analysis Finds

The strictest benchmark for real-world coding quality — METR's merge-rate methodology — has never been applied to any frontier AI model released after Claude Sonnet 4.5. That gap has been easy to overlook amid a steady stream of model releases. A new statistical analysis suggests it deserves more attention.

Independent researcher kqr, writing on entropicthoughts.com, has reanalyzed merge rate data originally collected by METR — the AI safety and evaluation organization — using leave-one-out cross-validation to test which mathematical model best fits the trend over time. Three candidates were evaluated: a linear growth trend (Brier score 0.0129), a piecewise step function (0.0117), and a constant — no change at all (0.0100). Lower Brier scores indicate better predictive fit. The flat line won.

The analysis rests on the difference between METR's two success criteria. "Passes all tests" measures whether automated tests accept the generated code. "Maintainer-mergeable" asks whether a human engineer would actually approve it. The distinction matters: under the stricter standard, the task duration at which LLMs succeed half the time drops from 50 minutes to just 8. Automated benchmarks, in other words, have been overstating real-world capability by a significant margin.

kqr's conclusion is direct: LLMs have not improved in their programming abilities for over a year. The dataset spans activity since early 2025, a period during which Anthropic, Google, OpenAI, and others each released models accompanied by claims of coding improvements. The post notes that similar capability claims circulated throughout 2025 — and the merge-rate data, for the models where it exists, did not support them.

What remains unresolved is the question of the most recent releases. kqr's postscript acknowledges that Anthropic and Google have both claimed step-change capability gains in late 2025 and early 2026 — but METR has not run its merge-rate evaluation on any of those models. Without that data, there is no rigorous basis on which to assess whether the latest frontier releases have actually moved the needle on the metric that matters most to practicing engineers.