Claude Opus 4.6 is hallucinating at twice the rate it did at launch. BridgeBench's hallucination benchmark shows the model's fabrication rate jumped from 16.7% at release to 33.0% when retested on April 12, 2026. That's a fall from the top ranks to tenth place on the leaderboard, putting it below models like Qwen 3.6 Plus and Gemini 3.1 Pro.
The benchmark, run by BridgeMind, tests how often AI models make false claims when analyzing code. It covers 30 tasks and 175 questions, with answers verified against actual code execution. xAI's Grok 4.20 Reasoning currently tops the board at just 10% hallucination. The original Opus 4.6 score and GPT-5.4 are tied at 16.7%.
So what happened? Hacker News commenters point to quantization as the likely culprit. When demand spikes, companies sometimes reduce model precision from 32-bit to 8-bit integers to serve more users on the same hardware. Research shows this can bump error rates on complex reasoning tasks by 15-30%, which lines up almost exactly with Opus 4.6's doubling. Anthropic hasn't confirmed any changes to their deployment setup.
If your workflow depends on a specific model's accuracy, periodic retesting isn't optional. The model you tested in January isn't guaranteed to be the one serving your API calls in April.