Spec-Driven Verification for Overnight Coding Agents

Abhishek Ray, author of the Claude Code Camp newsletter and an AI development trainer who has worked with over 100 engineers, published a detailed account on March 10 of a trust problem that emerges with autonomous coding agents: when tools like Gastown — an agent that runs overnight and ships code into unreviewed branches without supervision — bypass normal review cycles, standard quality controls break down. Ray argues that having Claude write its own tests produces what he calls a "self-congratulation machine" — the AI verifies its own assumptions rather than user intent — and that pairing a writer model with a reviewer model provides no real independence since both share the same training-derived blind spots.

Ray built opslane/verify to address this: an open-source Claude Code plugin available through the plugin marketplace that runs a separate automated verification layer after an agent builds a feature. The approach draws from Test-Driven Development — write plain-English acceptance criteria before prompting the agent, then verify against them afterward. The tool runs in four stages: a pure-bash pre-flight check that validates server liveness and auth before spending any tokens; a single Claude Opus call acting as planner and spec interpreter; parallel Claude Sonnet instances each controlling a Playwright MCP browser agent to execute one acceptance criterion at a time, capturing screenshots and session recordings; and a final Opus judge call that reviews all evidence and returns per-criterion pass/fail verdicts. It requires no custom backend and works with an existing Claude OAuth token.

Community reaction on Hacker News was divided. Several commenters questioned whether stacking four LLM calls introduces more complexity and cost than the problem warrants — one framed unsupervised overnight agent runs as a practice engineers will reconsider as the field matures. Others argued that a simpler two-agent setup — one writer, one reviewer — combined with careful upfront spec work and real-time human observation already delivers substantial productivity gains without the orchestration overhead. A separate suggestion floated in the thread: instruct Claude to spawn red/green/refactor subagents in clean-room context isolation as a lighter-weight alternative. Ray is candid about his own ceiling: the pipeline cannot catch cases where the original spec was wrong.

That last admission is the most honest thing in the post — and it exposes a real limit on what any verification layer can promise. The spec-first discipline is the genuinely useful idea here; the four-stage orchestration is the part the HN skeptics are right to press. Writing explicit acceptance criteria before prompting is sound engineering practice whether or not a team uses opslane/verify to automate the checks afterward. Whether they need Opus as planner, Sonnet running in parallel per criterion, and Opus again as judge — or whether one model tier against a well-written spec is sufficient — is a cost and complexity tradeoff, not a settled architectural question. Ray's pipeline is the most elaborate answer on the table. It may not need to be.