Golden Sets: Regression Engineering for Probabilistic Systems

Ryan Setter of Heavy Thought Laboratories has published a detailed engineering framework for preventing quality regressions in AI and LLM-powered systems, centered on what he calls 'golden sets' — curated test case collections that function as versioned, contractual evaluation artifacts rather than the vague spreadsheets and demo prompts that pass for evals at many AI teams. Published March 12, 2026, the piece argues that single-number quality scores are 'mostly decorative' and that real regression infrastructure requires multi-metric gates, explicit outcome classes, pinned rubric versions, and acceptance thresholds that actually gate whether a change ships to production.

The framework introduces six named change surfaces — prompt, model, retrieval, validators, tool contracts, and policy — each of which requires its own regression coverage. Setter provides a vendor-neutral JSON case schema as a reference implementation, with fields for must-include and must-not-include assertions, expected outcome classes (including refusal, fallback, and needs-human-review alongside plain success), and change-surface metadata that prevents retrieval regressions from being confused with prompt regressions. The explicit inclusion of outcome classes beyond simple correctness is notable: it directly addresses the failure mode where a system is rewarded for confidently producing the wrong answer.

Golden sets are positioned as a companion to Setter's broader 'Probabilistic Core / Deterministic Shell' architectural pattern, in which the LLM is treated as an inherently non-deterministic component wrapped in deterministic contracts, validators, and policy enforcement. The golden set is the gate that proves those deterministic invariants survived any given change. Setter also advocates for incident-driven case growth, arguing that every serious production failure should generate a new golden set case — framing production environments as 'a generous test author' that reveals edge cases no internal team would have thought to construct.

The approach matters most to teams running agentic systems — the kind doing write-gated actions, tool calls, or multi-step workflows where a silent regression means corrupted data or a broken customer interaction, not just a slightly worse answer. That said, building and maintaining a real golden set takes sustained discipline, and Setter's framework is thorough enough to feel like a significant commitment before you've written your first case. Teams already using evaluation platforms like Braintrust or LangSmith have some scaffolding to work from, but the harder parts Setter is describing — versioned rubrics, pinned thresholds, incident-driven case growth wired into CI gates — still sit largely outside what those tools handle out of the box. The framework is rigorous. The gap between understanding it and actually shipping it is another matter.