lf-lean: Frontier AI Models Achieve 350× Speedup on Verified Software Translation

Theorem, an AI safety and formal verification startup, has published lf-lean — a fully verified translation of all 1,276 theorems and definitions from the Logical Foundations textbook (originally written in the Rocq/Coq proof assistant) into the Lean ecosystem. The project, completed in collaboration with researchers from MARS, MATS, and METR, used frontier AI models including Claude 3.7 Sonnet and OpenAI's o3, with roughly 2 person-days of human effort. The team estimates the equivalent manual effort at around 2.75 person-years — a 350× speedup.

The central technical contribution is what Theorem calls 'task-level specification generators,' prototyped in a tool called rocq-dove. Many software transformations — translation, refactoring, optimization — are semantics-preserving: if you define what correctness means for a transformation class once, you can verify every instance automatically. That shifts human oversight from O(n), where reviewers must check each AI-generated artifact, to O(1), where the specification methodology is approved once and verification runs at scale without further human input. It's a direct response to a real constraint in AI-assisted development: as AI code output grows, human review capacity doesn't.

Theorem also mapped lf-lean onto METR's 'time horizon' benchmark, which tracks how long AI agents can autonomously sustain productive software work. When plotted alongside unverified software engineering results, the verified track advances more quickly, suggesting formal verification isn't simply lagging behind conventional AI coding benchmarks — it may be closing the gap faster. The paper frames this practically: in a world where AI generates more code than humans can meaningfully review, machine-proved correctness starts to look less like academic overhead and more like a scalability requirement.

Theorem has open-sourced both rocq-lean-import and rocq-dove, and the lf-lean corpus is explicitly offered as a benchmark dataset for future verified software engineering evaluations. The paper identifies scaling the specification-generator approach beyond textbook material — to production codebases with messier invariants and weaker formal specifications — as the next meaningful test of whether the technique holds.