Codegen Is Not Productivity: Why LLM Line Counts Are the Wrong Metric

Writing code was never the bottleneck in software development. A new opinion piece on antifound.com makes that case bluntly, arguing that lines of code generated by LLMs is a meaningless productivity metric — and that coding agents may be actively degrading software quality by pushing teams into implementation before design problems are properly understood.

The argument draws on foundational software engineering literature. The SICP preface put it plainly: "programs must be written for people to read, and only incidentally for machines to execute." Decades of research confirm that LOC is a poor predictor of defects, effort, or elapsed time. None of that changed when the code started writing itself.

The piece identifies several concrete harms from over-indexing on generation volume. Agent harnesses push teams toward implementation before design is adequately understood — models are trained to produce concrete output, so that is what they produce. They favor freshly generated code over existing well-tested libraries, handing teams a maintenance bill for functionality that already existed. They also constrain planning to text and markdown, a poor substitute for the low-stakes iteration that whiteboards and sketches enable. The author explicitly dismisses "plan mode" as insufficient: even planning workflows in current agent harnesses bias toward generating artifacts quickly rather than enabling disposable ideation. The economic logic is stark. Code is a high-fidelity prototype — expensive to change. AI tools are systematically accelerating the transition from cheap design phases to expensive implementation phases while leaving upstream uncertainty completely intact.

Hacker News commentary broadly validated the thesis. One commenter [username pending editorial verification] reframed the dynamic in cost terms: codegen lowers the cost of writing code but leaves the cost of knowing what to write completely unchanged. Teams now arrive at a full implementation faster while still holding a vague spec — vagueness is not eliminated, it is deferred to a more expensive point in the process. This maps directly onto Barry Boehm's cost-of-change curve from the 1970s, which established that defect costs rise exponentially from requirements through production. A self-described pro-AI writer [username pending editorial verification] offered a candid account of firsthand disillusionment — unnecessary complexity, an inability to convey business objectives to models even at high capability tiers, and eventually finding it faster to write and fix code manually.

Dominant vendor ROI narratives have centered on throughput: lines generated, pull requests merged, developer hours saved on keystroke-level tasks. GitHub Copilot's widely cited 2022 internal study claimed 55% faster task completion on isolated coding exercises, a controlled benchmark that deliberately excluded specification, design review, and maintenance phases. Cursor, Codeium, and similar tools have followed the same playbook. Standard benchmarks including HumanEval, SWE-bench, and LiveCodeBench do not meaningfully test how agents handle vague or incomplete specifications — the exact failure mode this piece is most concerned about. Vendors maximize what is measurable at the implementation layer. The accumulated design debt and maintenance burden accrue to the customer after the sale.