Computer scientist William J. Bowman published a direct challenge last week to how the AI industry evaluates generative model utility — not via ethics or economics, but with a three-variable analytical framework he argues should replace gut feeling entirely.
The post, titled "Against Vibes: When is a Generative Model Useful," was published March 5 on his personal website (williamjbowman.com). Bowman has been skeptical of generative models since their rise, and his complaint here is narrow but pointed: engineers who would never pick an algorithm on instinct routinely accept sweeping AI claims — "software engineering is dead" being his go-to example — with no underlying model to support them.
His three variables are encoding cost, verification cost, and what he calls artifact versus process dependency. Encoding cost asks how expensive it is to describe the task in a prompt relative to just producing the output yourself. Verification cost asks whether you can actually confirm the output is correct — and crucially, whether doing so demands the same domain expertise the model was supposed to replace in the first place. The third variable is the one most easily overlooked: does the task's value lie in the finished output, or in the act of doing the work? A researcher building a compiler from scratch isn't just after a working binary. The construction is the contribution.
Bowman runs the framework against two of his own experiences, and the contrast is blunt. Eight hours with Claude Opus attempting to generate a Haskell DSL interpreter: failure. The task had high encoding complexity, verification that required deep expertise, and significant process dependency for a researcher whose intellectual contribution was the work of building itself. A single-line agent prompt to identify and install a forgotten package: immediate success. Low encoding cost, trivially checkable output, nothing lost by not doing it manually.
For readers tracking the agent space, what's notable is what the framework implies about scope. If Bowman is right, the tasks where models consistently succeed — low complexity, easy to verify, finished output is all that matters — tend to be the lowest-stakes work. Difficult, high-value problems score poorly on all three dimensions simultaneously. He brackets ethical and economic concerns explicitly (describing investment at current scale as "almost criminal fiduciary negligence" before setting it aside) to keep the argument technical — but the technical argument alone is inconvenient for a lot of current agent marketing.
The framework has real limits. It's built on two personal case studies, which is a thin empirical base. It doesn't account for how these variables shift as models improve, and it doesn't help much with tasks that sit messily across all three dimensions at once. What it does offer is a vocabulary: a way to ask whether a claimed use case is actually a good one, rather than just a compelling demo. In a field where evaluation frameworks are scarce and hype is not, that's worth having.