Where do tokens actually go when agents build software? Researchers Mohamad Salim, Jasmine Latendresse, SayedHassan Khatoonabadi and Emad Shihab instrumented the ChatDev multi-agent framework running a GPT-5 reasoning model across 30 development tasks, mapping its internal phases onto the classic software lifecycle: design, coding, completion, review, testing and documentation.

The result inverts the intuition that generation is the expensive part. The iterative code review stage consumed an average of 59.4% of all tokens, and input tokens made up 53.9% of consumption overall, which the authors read as empirical evidence of significant inefficiency in how agents pass context between each other. The primary cost of agentic software engineering, they conclude, lies not in writing code but in automated refinement and verification.

For anyone pricing agent workloads, that is a rare concrete planning number. For framework authors, it points the optimisation target squarely at collaboration protocols and context handling rather than smarter first drafts.