Knuckledragger Brings Formal Verification to LLM-Generated RISC-V Assembly

Philip Zucker, a software engineer at Draper Laboratory in Cambridge, Massachusetts, published a detailed technical post on March 10, 2026 introducing a refined binary verification workflow built on Knuckledragger, his Python-based proof assistant. The system uses bisimulation to formally verify RISC-V and x86 assembly against higher-level specifications: users define a lifting function that maps low-level machine state — symbolically executed via pypcode, which wraps Ghidra's SLEIGH semantics — to a simpler compiler-IR-style model, then Z3's SMT solver automatically checks that executing at the low level and then lifting produces the same result as lifting first and then stepping through the high-level model. The square-commuting property, expressed as high_low ∘ step_low = step_high ∘ high_low, enables fully automated simulation proofs over all trace fragments between annotated entry and exit points in the assembly control-flow graph.

A RISC-V memcopy routine subjected to bounded model checking reveals a real off-by-one wraparound bug: when the length argument is zero, the loop counter underflows, a class of error Zucker notes is related to a bug found in an actual codebase. Fixing the routine with an initial zero-length guard eliminates the counterexample. A prior iteration of the work verified a ChatGPT-generated RISC-V array-summing routine against a recursive Z3 functional specification, directly illustrating the LLM-code-verification use case. The key improvement over earlier versions is that the new design avoids forcing users to work with noisy raw pcode symbolic executor output during interactive proof; instead, the bisimulation layer absorbs that complexity, and subsequent reasoning happens over the cleaner high-level model.

Zucker explicitly cites LLM-generated assembly as a primary motivation, arguing that bisimulation-based verification gives code-generating agents a formal mechanism to connect low-level output to a higher-level, more human-readable specification. Knuckledragger could serve as infrastructure for agent pipelines that need correctness guarantees beyond testing — particularly relevant as LLM-generated binary code enters security-sensitive contexts. The approach also applies to compiler validation and reverse engineering scenarios where source-level truth is unavailable. Knuckledragger's Python accessibility distinguishes it from the dominant binary verification ecosystems, which are built in Haskell (Galois's Crucible/What4/Macaw/SAW stack), OCaml (CMU's BAP), or Isabelle/HOL, and are largely funded by DARPA or defense contractors. The tradeoff is that Z3-backed proofs lack the kernel-checked soundness guarantees of full interactive theorem provers, and the pypcode dependency — last committed January 2025 with 209 GitHub stars — represents a maintenance uncertainty for the multi-architecture portability the system depends on. Knuckledragger itself currently has 159 GitHub stars and no stated external funding.