Toby Ord Warns AI Agent Costs Could Outpace Capabilities

Toby Ord has raised a question that almost nobody in AI is asking: are the costs of running frontier AI agents growing even faster than their capabilities? We know from METR benchmarks that AI agents can handle tasks of exponentially increasing length, jumping from seconds to hours of human-equivalent work over seven years. But Ord points out that model parameters grew 4,000x and token generation grew 100,000x in that same period. The bill for peak performance might be spiraling upward just as fast as the benchmarks are climbing.

His proposed metric is straightforward. Divide the cost of completing a task at a model's 50% success threshold by the human-equivalent hours that task takes. For Claude 4.1 Opus, that's a 2-hour time horizon. Take the cost to run that task, divide by two, and you get an hourly rate. Ord found that opinions on where these costs are heading vary wildly. Some assume total task costs stay flat, meaning hourly rates plummet. Others assume costs are climbing exponentially too. Most people he asked couldn't even ballpark what an AI agent costs per hour for software engineering work. Is it cents? Dollars? Hundreds?

METR's own cost-performance charts for models like GPT-5 reveal a clear pattern. Unlike humans, where cost scales linearly with task length, AI agents hit diminishing returns. Each model's curve plateaus as you pour in more compute. Ord annotates these charts with constant hourly-cost lines, finding each model has a "sweet spot" where it achieves its cheapest hourly rate before returns start fading. The real worry is what happens if the sweet spot keeps getting more expensive with each generation. Then top-tier models become like Formula 1 cars: proof of what's technically possible, but nothing you'd actually deploy for real work.

There are reasons for cautious optimism. Mixture of Experts architectures, like those in GPT-4 and Mixtral, activate only a fraction of parameters per token, breaking the link between model size and inference cost. Quantization techniques shrink weights from 16-bit to 4-bit integers. Hardware like NVIDIA's H100 and Google's TPU v5p pushes more throughput per dollar through lower-precision arithmetic. These optimizations matter because inference is memory-bound, not compute-bound. As GPU prices surge, these optimizations might be the only thing keeping costs in check. Still, Ord's core question deserves a real answer, and right now nobody has one. Until we track hourly costs alongside time horizons, we're guessing about whether AI agents are getting cheaper or just getting propped up by bigger compute budgets.