Someone pointed Andrej Karpathy's autonomous research loop at a CPU and got a 92% performance improvement in under ten hours. The auto-arch-tournament project, created by FeSens, treats CPU microarchitecture optimization like a genetic algorithm where an LLM acts as the mutation engine. It proposes changes to a 5-stage RV32IM core in SystemVerilog, implements them in isolated git worktrees, and runs the results through formal verification, cosimulation, and physical synthesis on a Gowin FPGA. After 73 hypotheses, only 10 stuck. The final design hits 2.91 CoreMark/MHz at 199 MHz, beating VexRiscv's human-tuned baseline by 56% on iter/s with 40% fewer LUTs.

The verifier does the real work here. Of those 73 hypotheses, 63 were wrong. They broke the ISA, failed timing, or regressed performance. The eval stack (riscv-formal for 53 symbolic checks, Verilator cosim against a Python ISS, nextpnr place-and-route, CoreMark CRC validation) is what turns an LLM's stochastic ideas into reliable progress using an agentic loop. Without rigorous validation, you're just generating broken RTL at scale. As the author puts it, whatever moat you think you have on the agent loop lasts about six months. The verifier is the thing nobody gets paid to build.

This differs from commercial EDA tools like Synopsys DSO.ai and Cadence Cerebrus, which use reinforcement learning to optimize power, performance, and area during physical implementation of a fixed design. Auto-Architecture operates higher up, at the architectural level, mutating the RTL itself. That's harder to verify but opens more headroom. The loop's third iteration pulled DIV/REM out of the single-cycle path and accidentally halved the LUT count. The LLM didn't predict that. It just tried something and watched what the synthesizer did through an autonomous research loop.