Golden Sets: Regression Engineering for Probabilistic AI Systems

Heavy Thought Laboratories published a technical doctrine piece on March 12, 2026, authored by Ryan Setter, arguing that AI and LLM workflows require a more disciplined approach to regression testing than most teams currently apply. The article introduces "golden sets" — curated, versioned collections of test cases paired with explicit scoring rubrics, acceptance thresholds, and outcome class definitions — as the foundational unit of pre-release quality control for probabilistic systems. Setter's central argument is that AI systems are uniquely capable of producing regressions that "sound plausible," where a prompt change might quietly degrade refusal behavior while improving answer quality in other areas, or a model upgrade might appear more capable while becoming less reliable under policy constraints.

The piece situates golden sets within a broader architectural pattern Setter calls "Probabilistic Core / Deterministic Shell," where the shell enforces behavioral constraints and the golden set verifies those constraints survived whatever change was just made. Rather than enumerate change surfaces as a list, Setter walks through each as a distinct failure scenario: prompt changes, model upgrades, retrieval quality shifts, validator and tool contract drift, and write-gating actions he calls "Two-Key Writes." Each golden set case is specified in a vendor-neutral JSON format carrying an input payload, constraints, expected outcome class, must-include and must-not-include assertions, a pinned rubric version, and change-surface metadata tags. Setter explicitly dismisses common substitutes — demo prompts, vague spreadsheets, and off-the-shelf benchmark numbers — as insufficient, and argues that single-number aggregate quality scores are "mostly decorative" given that useful gates must be multi-metric and tied to specific failure classes.

The article's reception on Hacker News was dominated by a sharp irony: multiple commenters, led by user mpalmer, argued that the article itself exhibits the hallmarks of AI-generated prose. Critics pointed to monotone sentence structure, excessive uniform bullet-point formatting, and repetitive restatement of the same points — mpalmer counted seven near-identical constructions following the pattern "[X] is not [Y]. It is [Z]." The observation that an article advocating rigorous quality gates for AI-generated outputs appears to be an unedited AI output became the dominant thread of discussion, overshadowing the technical substance. One commenter noted that the opening lines showed genuine engagement, suggesting that whatever original voice existed in the piece was largely overwritten during generation — leaving those seven parallel constructions standing as an unintentional live demo of exactly the monotone drift that a well-specified golden set would have caught before publish.