Autonoma, an AI-powered QA testing platform, is scrapping 18 months of production code to rebuild from scratch. Co-founder Tom Piaggio detailed the decision in a post published March 10, 2026, citing two compounding factors: a deliberate "no-tests, non-strict TypeScript" engineering culture that worked for a two-person team but caused cascading instability as headcount grew to 14, and a fundamental shift in what modern LLMs can do without elaborate scaffolding. The company had raised funding, closed numerous clients, and was growing operationally when Piaggio made the call.

The technical rationale centers on model capability improvements since Autonoma's original build. When the platform was first developed during the GPT-4 (pre-4o) era, the team had to construct sophisticated Playwright and Appium wrappers — including seven self-healing click strategies — to make agentic QA work reliably. Piaggio argues those guardrails are now unnecessary and that inheriting the associated technical debt would provide little benefit. The rewrite also involves a deliberate stack change: Next.js and Server Actions are being dropped in favor of React with tRPC and TanStack Start on the frontend, and a Hono backend. Piaggio's post is unusually candid about Server Actions' flaws, citing sequential global execution, poor testability, lack of dependency injection, and inadequate Sentry observability as fundamental issues rather than edge cases. Workflow orchestration is moving to Argo on Kubernetes; both Temporal and useworkflow.dev were evaluated and rejected as incompatible with Autonoma's stateful mobile and web job model.

Autonoma isn't the first agentic team to find that early scaffolding had become the problem. German AI testing startup Octomind documented a similar inflection point in June 2024, removing LangChain entirely from its production stack after finding that the framework's abstractions — designed to compensate for weaker 2022-era models — actively prevented the multi-agent patterns that more capable models now support natively. Researchers have documented this pattern across multiple teams: guardrails that compensated for weaker models can actively degrade performance with more capable ones. The open-source tool Shortest, released after Anthropic launched computer use in October 2024, offers the starkest illustration: tests written in plain English are executed by Claude 3.5 Sonnet reading the screen visually, with no DOM-parsing logic or selector strategies required — achieving in days what Autonoma spent 18 months building elaborate infrastructure to approximate. <a href="/news/2026-03-14-spec-driven-verification-claude-code-agents">Specification-driven verification approaches</a> for autonomous systems exemplify this same shift toward direct LLM-powered execution over complex infrastructure.

The Hacker News discussion around Piaggio's post surfaced real skepticism. Simon Willison argued that tests are precisely what enable fast shipping by providing safety nets for new features — a direct challenge to the anti-testing philosophy Piaggio himself now repudiates. Others invoked the classic "second system" risk: rewrites tend to become slower, over-engineered, and less loved than the scrappy original, and the organizational habits that produced the first codebase's problems may not be solved by starting over. Piaggio's post explicitly acknowledges the culture failure — an admission that goes further than most founders will put on the record. Willison's counter-argument is worth holding onto: the second system needs tests too.