Anthropic finds infrastructure config can swing agentic coding benchmarks by 6+ percentage points

Anthropic engineers have published findings showing that infrastructure configuration alone can shift scores on agentic coding benchmarks by 6 or more percentage points—a margin that frequently exceeds the gap between top frontier models on public leaderboards. Running Terminal-Bench 2.0 on a Google Kubernetes Engine cluster, the team discovered their scores diverged from the benchmark's official leaderboard because their Kubernetes setup treated per-task resource specifications as both a guaranteed floor and a hard ceiling. Any momentary memory spike would OOM-kill a container that might otherwise have succeeded, with up to 6% of tasks failing due to pod errors entirely unrelated to model capability. Terminal-Bench's official leaderboard, by contrast, uses a sandboxing provider that permits temporary overallocation, providing implicit headroom that suppresses such failures.

To quantify the effect, Anthropic held all other variables constant—same Claude model, same harness, same task set—and ran Terminal-Bench 2.0 across six resource configurations ranging from strict 1x enforcement to fully uncapped. Infrastructure error rates dropped monotonically from 5.8% at 1x to 0.5% when uncapped. The relationship between headroom and success scores proved nonlinear: between 1x and roughly 3x allocation, success rates fluctuated within noise (p=0.40), but above 3x they climbed faster than infra errors declined, as generous allocations unlocked solution strategies requiring large dependencies, expensive subprocesses, or memory-intensive test suites. The total lift from 1x to uncapped was 6 percentage points (p < 0.01). A crossover experiment on SWE-bench showed the same directional effect, though smaller in magnitude at 1.54 percentage points between 1x and 5x RAM.

Because top frontier models from Anthropic, OpenAI, Google, and Meta are routinely separated by just 1–3 points on leaderboards, any ranking within that 6-point infrastructure margin is statistically indistinguishable from evaluation substrate noise rather than genuine capability difference. The findings also expose an architectural asymmetry: agents that default to installing heavyweight toolchains like pandas and scikit-learn will rank higher under generous resource policies, while agents that implement lean, standard-library solutions will rank higher under tight ones. Different labs have trained their models with different default coding behaviors, meaning a benchmark's resource policy can systematically favor one architectural philosophy over another without this being visible in the published score.

Anthropic's recommendation is to treat resource configuration as a first-class experimental variable. Benchmarks should specify both a guaranteed allocation and a separate hard kill threshold per task—rather than a single pinned value—and evaluation reports should disclose enforcement methodology alongside scores. Resource allocation is not the only hidden variable: time limits, cluster health, hardware specs, concurrency levels, and API latency variance by time of day can all skew results. The full writeup is available at anthropic.com/engineering/infrastructure-noise.