Qwen3.6-27B beats Claude Opus 4.5? Benchmark methods questioned

Alibaba's DAMO Academy released Qwen3.6-27B on Hugging Face this week, and the benchmark numbers turned heads. 94.1 on MathArena AIME 2026. 87.8 on GPQA. Internal benchmarks reportedly show it beating Anthropic's Claude Opus 4.5. Bold claim for an open-weight model at 27 billion parameters.

The community noticed something off. Hacker News commenters flagged that the Terminal-Bench 2.0 scores used non-standard parameters: a 3-hour timeout and 32 CPU cores with 48 GB of RAM. The benchmark's rules explicitly disallow modifying timeouts or resource allocations. Those settings exist to test specific capabilities. That matters. It's the difference between a fair fight and a stacked deck.

The model handles image-text-to-text tasks, tool calling, function use, and reasoning. Unsloth already shipped a 4-bit quantized GGUF version, so you can run it without enterprise hardware. Vision, tool use, and reasoning in something that fits on a single GPU.

An open-weight model competing with frontier models like GLM-5.1 is a big deal. Whether those benchmark claims hold up under standard conditions remains an open question.