Claude Opus 4.6 is making things up at twice the rate it did when it launched. BridgeBench, a code analysis benchmark run by BridgeMind, retested Anthropic's flagship model on April 12, 2026. Its hallucination rate jumped from 16.7% to 33%. Claude Opus 4.6 Doubles Its Hallucination Rate Since Launch
A model that tied for second place at release now sits at 10th.
The benchmark tests 27 models across 175 questions, measuring how often AI invents false claims when analyzing code. The likely culprit is post-launch optimization. Running frontier models at scale is expensive. Providers routinely apply techniques like model compression or speculative decoding to cut GPU costs and reduce latency. These trade speed and price for precision.
GPT-5.4 held steady at 16.7% hallucination in the same benchmark, suggesting OpenAI either hasn't made similar compromises or managed them better. xAI's Grok 4.20 Reasoning currently leads the board at 10%
Anthropic hasn't publicly commented on the regression.
For teams relying on Claude Opus 4.6 for code analysis, a model that confidently invents bugs or misidentifies vulnerabilities is actively misleading, not just unreliable. The broader lesson is uncomfortable: a model's performance at launch is a snapshot, not a guarantee. Production infrastructure decisions can quietly erode quality while benchmark numbers from months ago still circulate in marketing materials.