The AGENTS.md file fails by obedience, not neglect

A team at ETH Zurich and the startup LogicStar.ai ran the experiment the industry skipped. They took four coding agents, gave each one a repository three different ways, and counted how often each version actually closed out a real GitHub issue. The three versions: no context file at all, a context file the agent generated for itself, and the context file a human developer had already committed to the repo. The headline number is the one no vendor wanted to publish. The auto-generated AGENTS.md file, the thing every coding tool now tells you to produce with a single /init command, made the agents slightly worse. Success rates fell by roughly 3% on average, and the inference bill rose by more than 20%. The full preprint, Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?, landed in February.

The pitch

The case for context files is everywhere, and it isn't stupid. The AGENTS.md standard was formalised in August 2025 by OpenAI, Google, Cursor, Factory and Sourcegraph, and now sits under the Linux Foundation's Agentic AI Foundation alongside MCP. More than 60,000 public repositories carry one. Anthropic publishes a whole guide on writing CLAUDE.md files, OpenAI ships the /init command that writes one for you, and the promise is consistent across all of them: give the agent a short README written for machines, covering the directory layout, the build and test commands, the style conventions, and it will navigate your codebase faster and break your conventions less. The anecdotes back it up. Plenty of developers report that an agent got noticeably more competent the moment they added a context file. The practice spread because it felt true.

What it never had was a measurement. The files were too new to exist in old benchmarks, and the popular repos those benchmarks are built from aren't where most code actually lives. So the ETH group built a new one. AGENTBENCH is 138 issues drawn from 12 niche but active Python repositories that already carried developer-written context files, things like fastmcp, wagtail, tinygrad and the OpenAI Agents SDK, paired against the established SWE-bench Lite set of 300 tasks from popular projects. Then they ran Claude Code on Sonnet 4.5, Codex on GPT-5.2 and GPT-5.1 Mini, and Qwen Code on Qwen3-30B, each in its own harness, each repo in all three settings.

The read

The reason the files fail is more interesting than the fact that they do. The obvious story, and the one most of the coverage ran with, is that agents ignore bloated instructions or get confused by them. The traces say the opposite. The agents read the files and did exactly what they said. When a context file mentioned uv, the agent reached for uv about 1.6 times per task; when it wasn't mentioned, essentially never. Repository-specific tooling went from near-zero use to 2.5 calls per task once the file named it. Instruction-following was not the failure. Instruction-following was the mechanism of the failure.

Every line in a context file is a requirement the model now tries to satisfy, and each one costs steps and tokens. A generated AGENTS.md that lays out the directory tree, lists the testing conventions, asks for the full suite to be run and the style guide to be honoured turns a narrow bug-fix into a compliance exercise. With files present, the agents ran more tests, did more grep and read and write calls, explored more broadly, and on the reasoning-heavy models spent up to 22% more thinking tokens. The file made them more thorough and more expensive without making them more correct. A well-followed instruction you didn't need is still work the agent has to do.

The cleanest demolition is the codebase overview, the single most-recommended section. Eight of the twelve human files and very nearly 100% of the Sonnet-generated ones included one. It did not help the agents find the relevant files any faster. On the metric the authors built for exactly this, steps before the agent first touches a file that the real fix changed, the map made no difference. In one model the files made it worse, because the agent kept issuing commands to locate and re-read context that was already sitting in its window. Modern agents are good at exploring a repo on their own. Handing them a tour they didn't ask for mostly gives them something extra to read.

The strongest case against this

The defenders aren't fools, and the paper quietly vindicates a piece of their case. Buried in the ablations is the finding that rescues the anecdotes: when the researchers stripped a repository of all its other documentation, the README, the docs/ folder, the example code, the generated context file suddenly helped, by 2.7% on average, and beat the human-written file. So context files do work, in repos that have nothing else. That describes an enormous share of real code: the internal tool nobody documented, the half-finished side project, the 200-star library with a one-line README. The benchmark's "no file" baseline is usually a well-documented, popular project where the agent can just read the docs. Strip the docs and the file earns its keep. The people reporting that AGENTS.md transformed their workflow were, near as I can tell, mostly working in documentation deserts, and they were right about what they saw.

Two more fair points. The benchmark measures one thing: did the patch make the tests pass. It says nothing about whether the agent followed the house style, used the right commit format, stayed out of the migrations folder, or refrained from reformatting four hundred unrelated lines. A file that costs you 3% on a benchmark but stops the agent rewriting your CI config is plausibly worth it on a codebase you have to maintain, and the authors say as much. And the study is almost entirely Python, a language so over-represented in training data that the models already know its tooling cold. For a niche language or an unusual build system, the file might be the only thing between the agent and a wrong command.

All of that is true, and none of it saves the default. The thing the industry actually recommends is the universal version: tailor your agent to your repo and it will do better, run /init and commit what comes out. That is the claim the paper kills. The generated file lost on documented repos, lost across all four models, lost whether a stronger model wrote it, and lost whether the generator used Codex's prompt or Claude Code's. The redemption case is narrow and conditional on your repo being a documentation desert, which is an argument for writing docs, not for generating an AGENTS.md. The guardrail-and-style argument is the real survivor, and notice it points the same way the data does: the file that helps is short, specific, and made of things the agent cannot infer. Use uv, not pip. The test command is this. Don't touch /vendored. That's five lines a human writes with a real problem in mind, and it's the opposite of an auto-generated codebase tour.

The bigger pattern

This is about more than one Markdown file. The whole direction of agent tooling over the past year has been accumulation: memory files, rules files, more MCP servers, longer system prompts, more retrieved context shovelled into the window on the theory that more context yields a better agent. The AGENTS.md result is a clean data point against the unexamined form of that theory. Past some threshold the marginal instruction has negative returns. It isn't free to read, it isn't free to obey, and obedience to an unnecessary instruction looks, from the inside of the model, identical to doing useful work. Context has a cost curve and almost nobody is measuring where it bends. The discipline that wins here isn't context accumulation. It's context curation.

The bet

The authors handed over a clean test, so I'll use it. The load-bearing promise of the whole movement is that an automatically generated, repo-tailored context file beats no file at all on a normal, documented codebase. In this study, on documented repos, with four agents, multiple generator models and multiple prompts, it never once did. So the honest default today is to delete the file your /init command produced and replace it with the handful of lines only you know. If a vendor ships an /init that clears the no-file baseline on a well-documented benchmark repo, published and reproducible and not cherry-picked, I'll change my read the same day. Until then, the generated context file is mostly a tax you're paying for the feeling of having configured something.