Grok 4.20 costs 173 times more than its predecessor. The benchmarks don't back it up.

xAI shipped Grok 4.20 Beta on Thursday, and the benchmark numbers tell an uncomfortable story. According to AI Benchy — which tests models across multiple languages and task types — the new release scores a 7.0 average and lands at #24 overall. Its predecessor, Grok 4.1 Fast, sits at #32 with a 6.2. Eight positions and 0.8 points of separation, which might sound like progress until you look at what it costs to get there.

Grok 4.20 Beta runs at roughly $0.97 per correct answer. Grok 4.1 Fast costs $0.0056. That's a 173x price increase for a model that moved eight spots up a leaderboard where Google's Gemini 3 Flash Preview is sitting untouched at #1 with a perfect 10.0 and a 100% test pass rate.

The multi-agent variant makes things worse. Grok 4.20 Multi-Agent Beta ranks #47 with a 4.9 average — lower than both single-model Grok releases and below dozens of cheaper alternatives on the same leaderboard. The multi-agent framing has become something of an industry default assumption: more orchestration, more capability. The xAI data pushes back on that. Whatever architectural decisions went into the multi-agent build, they didn't help on AI Benchy's test suite.

The broader leaderboard context is worth sitting with. Behind Gemini 3 Flash Preview sit Gemini 3.1 Pro Preview and ByteDance's Seed-2.0-Lite, followed by GPT-5.3-Codex and Alibaba's Qwen3.5 Plus in the top five. That's Google, OpenAI, and two Chinese labs holding the upper tier while xAI's flagship new release lands in the mid-twenties. For a company that has spent the better part of two years marketing Grok as a frontier competitor, the current position is a hard gap to paper over.

None of this makes Grok 4.20 a bad model in absolute terms — a 7.0 average across a multilingual evaluation isn't embarrassing. But for developers running cost-performance calculations on production deployments, the math is difficult to justify. The question isn't whether Grok 4.20 outperforms Grok 4.1 Fast. It does, narrowly. The question is whether it's 173 times better. It isn't.