A developer essay published this week by Nicolas Wilmet argues that the primary bottleneck holding back LLM agents is neither model capability nor tool access, but automated validation — the iterative feedback layer that confirms whether an agent's outputs actually meet the goal. Writing at nicowil.me, Wilmet frames any successful agent workflow as requiring three components: knowledge (from training and context), access (from harnesses like Claude Code or OpenAI Codex), and automated validation. Of these, he identifies validation as "the layer that's least defined right now in active usage," even as tool access has expanded rapidly across the ecosystem.
To make the case concrete, Wilmet describes a personal blog migration project using Claude Code to convert a Jekyll site to Next.js. Despite the task being well-represented in Claude's training data, the agent produced incorrect output: posts were rendered broken and a post flagged "published: false" in its frontmatter was inadvertently made live. A vague corrective prompt — "fix the broken rendered posts" — was enough to trigger remediation. Wilmet's reading is that the agent had the capability to self-correct but lacked the default behavior to proactively validate its own outputs mid-task. The agent never independently checked link clickability, rendering correctness, or frontmatter flags.
The essay draws a sharp line between tasks suited to automated validation and those that require human "taste." Structured, deterministic operations like find-and-replace are cleanly validatable: a tool either works or it doesn't. Unstructured content — such as the mix of HTML, Markdown, and scripts in Wilmet's blog posts — resists formal specification. Human taste, described as the instinctive ability to recognize when something looks wrong, fills that gap. Wilmet's practical conclusion is that tasks easiest to validate automatically will become the easiest to fully automate, a useful design heuristic for teams building agent workflows.
Agent evaluation tooling has grown substantially around parts of this problem, but not the specific part Wilmet describes. Platforms like Galileo, Braintrust, and LangSmith cover post-hoc observability and evaluation. Open-source options like Langfuse — reportedly acquired by ClickHouse, though the deal terms have not been independently confirmed — and DeepEval have found commercial traction. Runtime guardrail libraries such as Guardrails AI and Pydantic AI's built-in Evals module handle boundary-level policy and format checks. What remains unbuilt, as Wilmet's case study shows, is proactive <a href="/news/2026-03-14-spec-driven-verification-claude-code-agents">in-loop validation natively integrated into agent execution</a> — the mechanism that would have caught the hidden-post error before it was committed. Hacker News discussion following the post echoed the gap, with commenters floating ideas like a "validation keyword" to trigger self-checking after key operations, but no consensus solution has emerged.