Against Vibes: A Framework for Evaluating When Generative Models Are Actually Useful

Academic researcher William J. Bowman published a blog post in early March 2026 proposing a three-factor framework for rigorously evaluating when generative models and LLM-based agents deliver genuine utility — as opposed to what he calls "vibes." Bowman, a self-described generative model skeptic, argues that current discourse around AI agent productivity is fundamentally unscientific, with proponents making sweeping claims that cannot withstand empirical scrutiny. His framework centers on three interdependent variables: relative encoding cost (the effort required to write a prompt versus directly producing the artifact yourself), relative verification cost (how hard it is to validate generated output versus human-produced output), and artifact versus process dependence (whether the value of a task lies in the deliverable or in the act of creation itself).

The verification-cost insight is the most counterintuitive part of the framework. Bowman argues that as models improve and generate increasingly plausible-but-wrong output, verification becomes harder, not easier — a dynamic that undermines naive assumptions about capability gains translating directly into productivity gains. His model predicts that generative tools deliver net-positive value only when encoding is cheap, verification is trivial, and the artifact rather than the process is what matters. For complex, semantically dense, or process-driven work — such as research software design — he argues models are often counterproductive, because the act of writing or designing is itself where knowledge and conceptual clarity are generated.

The post gained traction on Hacker News, where commenters broadly validated the framework from direct experience with AI coding agents. One commenter described a verification-cost trap encountered when using autonomous AI agents to rewrite codebases: generation was passive and boring, but review became confusing and frustrating, disrupting the mental model maintained during fine-grained, synchronous AI assistance. Another noted that for research software, writing code is how conceptual clarity emerges — and that AI tools can accelerate travel down the wrong path. A dissenting commenter offered a counterpoint consistent with Bowman's own model: LLMs provide genuine high-value utility for low-complexity, high-friction tasks such as resolving cryptic dependency errors, precisely the low-encoding, low-verification-cost scenarios the framework predicts should benefit most.

The framework is well-timed: organizations are deploying generative models broadly, often without systematic task analysis. Bowman's post does not escape its own limitations — he invokes empirical research refuting subjective productivity claims without citing specific studies, weakening its standing as a citable scientific argument. Relevant peer-reviewed work exists in the literature and aligns with his claims: a 2023 Microsoft Research study on GitHub Copilot by Peng et al., and a NBER working paper by Brynjolfsson, Li, and Raymond measuring objective productivity impacts on customer support agents, both go unattributed. The three-factor model still gives practitioners a concrete mental model for evaluating specific agent deployments, and the reception suggests appetite for a more disciplined conversation about where AI agents actually create value.