A blogger at Dynomight ran an informal but revealing benchmark in March 2026: prompt eight frontier LLMs to derive closed-form equations predicting how quickly boiling water cools in a ceramic mug under specified ambient conditions. Six returned usable answers. Two — DeepSeek and Grok — billed for compute and delivered nothing.

The task was chosen for its deliberate ambiguity. Unlike math olympiad problems or coding challenges, it has no ground-truth derivable answer. It requires what the author calls "taste" — identifying dominant physical phenomena, committing to a plausible quantitative model, and making calibrated assumptions about underspecified parameters. Six models cleared that bar: Claude 4.6 Opus in reasoning mode, GPT 5.4, Gemini 3.1 Pro, Kimi K2.5, Qwen3-235B, and GLM-4.7. All six independently converged on exponential-decay forms — physically sound, consistent with Newton's Law of Cooling — with several spontaneously identifying two distinct decay timescales reflecting the separate thermal dynamics of the water and the mug.

Claude 4.6 Opus in reasoning mode produced the best-fitting result, with the equation 20 + 55·exp(-t/1700) + 25·exp(-t/43), at a cost of $0.61 per query. Moonshot AI's Kimi K2.5 came in at $0.01 and still correctly captured both the fast heat transfer from water into the mug and the slower dissipation to air. When the author ran the physical experiment under 20°C ambient conditions, every model underestimated early-phase cooling — likely driven by evaporative and convective effects that don't emerge from a text prompt — and overestimated long-term heat loss. None were highly accurate. All were in the right ballpark.

DeepSeek and Grok never got there. Both entered what the author describes as endless "flailing" without ever committing to an answer, while the meter kept running. This isn't a quality gap — it's complete non-delivery on a task requiring undergraduate physics and basic exponential modeling. Models fine-tuned to avoid committing to unverifiable claims may be well-calibrated for factual retrieval. On estimation tasks, that same caution is counterproductive: a principled approximation under irreducible uncertainty is exactly the right output, and a model that can't produce one doesn't fail gracefully — it burns budget and blocks whatever comes next.

That failure mode matters more than the cost gap, though the cost gap is real: $0.01 versus $0.61 for a task where the cheaper model still produced a physically plausible result. For teams running agentic pipelines over problems that are underdetermined by nature — lab automation, robotics planning — the ability to commit to a useful approximation is <a href="/news/2026-03-14-against-vibes-evaluating-generative-model-utility">the criterion that matters</a>. Existing leaderboards don't test for it. The Dynomight coffee experiment, however informal, does.