XBOW just published their analysis of GPT-5.5, and the numbers are stark. GPT-5 missed 40% of known vulnerabilities in their benchmarks. Opus 4.6 cut that to 18%. GPT-5.5 brings it down to 10%. Every missed vulnerability is a real liability when you're running automated security testing, says Albert Ziegler, XBOW's Head of AI. Ten percent still means one in ten slips through, but we're now in a different ballpark.

GPT-5.5 running without source code access already outperforms GPT-5 with full source code. Black box testing used to feel like fighting with oven mitts on, as Ziegler puts it. Now it's working barehanded. When they gave GPT-5.5 the source code too, the improvement was so large it "effectively killed" their benchmark. The model just finds nearly everything.

The tedious parts of actual testing got easier too. GPT-5.5 logs into target systems in roughly half the iterations of the next best model. It fails faster, recognizing blocked access or bad credentials and moving on instead of burning time. XBOW calls this the "persist or pivot" problem. Models trained with RLHF struggle with it because they're optimized to keep users happy, and nobody likes being told to give up. GPT-5.5 still persists longer than ideal sometimes, but only half as often as previous GPT versions or Opus.

XBOW compares GPT-5.5 to Anthropic's Mythos model. The difference? You can actually use this one. Mythos put up similar numbers in limited testing, but good luck getting access. Some community members called the XBOW analysis promotional, noting that smaller open-weight models have shown similar capabilities on narrower benchmarks. Fair pushback. Still, if you're building security tooling that needs to run today, GPT-5.5 is the first model that makes automated pentesting feel less like a gamble.