Statistical Analysis Finds LLM Code Quality Flat Since Early 2025

A statistical reanalysis of METR's SWE-Bench data, posted March 12 by blogger kqr (entropicthoughts.com), finds that AI-generated code quality — measured by whether a human maintainer would approve it for merging, not just whether it passes automated tests — has been flat since early 2025.

The underlying data comes from METR, an independent AI safety and evaluation organization that had flagged a significant gap between two success criteria. On SWE-Bench, code that passes automated tests performs far better than code a maintainer would actually merge: the 50% success horizon drops from 50 minutes to just 8 minutes under the stricter criterion. That gap alone reframes how to read most LLM coding benchmarks.

Using leave-one-out cross-validation, kqr tested three models against METR's merge rate data. A gentle upward slope — the trend METR itself suggested — scored a Brier score of 0.0129. A step function scored 0.0117. A flat constant predicting zero change across the entire period scored best at 0.0100. kqr's reading: a real capability jump happened in late 2024, then stopped. The merge rate since early 2025 is essentially a horizontal line.

Reaction on Hacker News pushed back in useful ways. Commenter wongarsu argued the METR graph does show visible improvement across Claude Sonnet 3.7, Opus 4.0, and Sonnet 4.5, and that GPT-5 should be treated as an outlier — a single OpenAI data point that skews the overall picture. Commenter aerhardt offered a practitioner's view: agentic tools like Claude Code and OpenAI's Codex have genuinely improved terminal-based workflows, but the underlying code still tends toward over-engineered control flow, unsound architecture, and superficial fixes. A third commenter pointed to a possible structural explanation for the plateau: frontier labs have shifted from raw parameter scaling toward agentic post-training, which may show up as product-level improvement without moving benchmark numbers.

kqr closes with a pointed observation: neither Anthropic nor Google has released METR-style merge rate evaluations for their newest models, leaving capability claims unverified against the one metric that showed a plateau.