Security researcher Kasra Rahjerdi built a deliberately vulnerable book-review app and spent $1,500 finding out which LLM agents could actually break it. The flaw was a real-world class he has seen in production: a hardened API sitting in front of a wide-open Firebase database.
GPT-5.5 led with 7 solves out of 10, though at $9.46 per success and a median 260k tokens per run. DeepSeek V4 Pro solved 3 of 10 at $0.62 per solve, around 15x cheaper. Claude Sonnet 4.6 and Opus 4.8 each managed 2 of 10, with Opus getting close several times before late safety refusals killed the session. Gemini 3.1 Pro Preview scored zero by refusing almost immediately, visible in a median 9k tokens per run against 100k-plus for every other model.
Rahjerdi is clear it is "not a scientific eval," and about half the spend went on failed or test runs. Still, capability, cost per solve and refusal behaviour each rank the field differently, and that spread is the point.