Qodo has published results claiming its AI code review tool beats Anthropic's Claude Code Review by 12 F1 points. The catch: the comparison was run on a benchmark Qodo itself designed.
The Qodo Code Review Benchmark 1.0, introduced in the company's research paper "Beyond Surface-Level Bugs: Benchmarking AI Code Review on Scale," works by injecting realistic defects into 100 merged pull requests drawn from production open-source repositories — 580 issues across 8 codebases and 7 languages including TypeScript, Python, Rust, and Swift. NVIDIA has since used the same benchmark to evaluate its Nemotron 3 Super model, which Qodo points to as evidence the methodology is gaining traction beyond its own walls.
Both tools were tested under out-of-the-box default configurations. Claude Code Review — which itself runs parallel agents, verifies findings, and posts inline GitHub comments — was set up with auto-generated AGENTS.md rules committed to each repository root, matching how a new customer would deploy it. On precision, the two systems were nearly even: when either flagged an issue, it was roughly as likely to be a genuine defect. The gap was in recall. Qodo's standard production setup (79% precision, 60% recall) already outpaced Claude. A more elaborate research configuration, using an orchestrated multi-agent harness, extended the lead to the full 12-point F1 margin without any drop in precision.
Qodo attributes the recall advantage to two architectural bets. The first is specialization: instead of routing an entire pull request to a single agent, Qodo dispatches purpose-built agents for distinct problem types — logical errors, edge cases, best-practice violations, cross-file dependencies — then merges and deduplicates their outputs. The second is model diversity. Qodo pulls from OpenAI, Anthropic, and Google rather than staying within one provider's ecosystem, arguing that no single model family handles every analytical task equally well.
On price, Qodo is blunt. Claude Code Review runs $15–$25 per review on a token-usage basis. Qodo claims its own cost is roughly an order of magnitude lower — a gap that compounds fast for teams running reviews at scale.
The obvious caveat hangs over all of it: Qodo wrote the test. The choices of which defect types to inject, which repositories to sample, and how to score outputs all came from the company with the most to gain from a favorable result. That doesn't make the findings wrong. But until an independent party runs the same evaluation, these numbers are best read as a well-constructed marketing claim rather than a neutral verdict.