When OpenAI quietly stopped reporting SWE-bench Verified scores, it was an admission the industry had been dancing around: public coding benchmarks are in bad shape. Frontier models can reproduce reference solutions from training data. Close to 60% of unsolved SWE-bench problems turn out to have broken tests. The leaderboard numbers had started to mean very little.
Cursor's parent company Anysphere decided to build its own measurement system instead. The result is CursorBench, an internal evaluation suite detailed in a post by researcher Naman Jain. Rather than pulling from curated GitHub issues or purpose-built puzzle tasks, CursorBench sources problems from actual developer sessions through a proprietary tool called Cursor Blame, which traces committed code back to the original agent request that produced it. The problems are real. The ground-truth solutions are real. That's the whole point.
Jain names three specific ways public offline evals have failed. Most SWE benchmarks focus narrowly on bug-fixing, which doesn't reflect the range of things developers actually use agents for. Automated graders routinely penalize correct solutions that don't match the reference answer. And training data contamination has pushed frontier model scores close enough together that rankings barely distinguish between them anymore.
CursorBench-3, the current version, runs tasks that involve more files and more lines of code than SWE-bench equivalents — multi-workspace monorepos, production log investigations, experiments that run long. Task descriptions are kept short and deliberately underspecified, because that's how developers actually talk to agents. The suite gets rebuilt every few months to stay aligned with how developer behavior is shifting.
Correctness results are plotted against median completion tokens, which exposes the speed-accuracy tradeoff in a form that's actually useful for a product people use under time pressure. CursorBench-3 shows meaningfully more separation between frontier models than the public alternatives, which matters if you're trying to make real model selection decisions.
Cursor also runs controlled live traffic experiments alongside the offline suite. The goal is to catch regressions that look fine to a grader but feel worse to a developer mid-session — a failure mode offline evals can't easily detect. The two systems are designed to check each other.
Cursor isn't the only AI developer tools company moving in this direction, but it's one of the few that has published the methodology. The broader shift is practical: when public leaderboards are contaminated and crowded, companies that depend on model quality have to build their own infrastructure to measure it.