A developer named Abid Sikder published an experiment on March 14, 2026, pitting four audio-capable AI models against 65 popular songs spanning trance, hip-hop, classic rock, and pop. Using OpenRouter as the access layer, Sikder prompted Google's Gemini 3.1 Flash Lite and Gemini 3.1 Pro, Mistral AI's Voxtral Small 24B, and OpenRouter's experimental Healer Alpha to each write a short review and assign a 0-10 rating to every track. The experiment's scope was limited by a practical constraint: as of early 2026, only a handful of models support both audio input and structured JSON output simultaneously — a gap Sikder pointedly attributes to OpenAI, which he notes still lacks structured output support for its audio-capable models.

The ratings reveal just how differently these models approach the same audio. Gemini 3.1 Pro showed the widest range and the sharpest critical instincts, handing Rick Astley's "Never Gonna Give You Up" and Rebecca Black's "Friday" both a 1, while awarding The Clash's "London Calling" a perfect 10. Voxtral Small 24B, by contrast, rarely dipped below 7 and gave Rick Astley the same perfect 10 it withheld from almost nothing — including, most strikingly, a recording of nails on a chalkboard, which it rated an 8.

That control clip is the sharpest finding in the piece. Sikder inserted it deliberately as a baseline for unpleasant audio. Every other model rated it 1 or 2. Voxtral's 8 is either a systematic upward bias or a genuinely broken sense of what constitutes a bad sound — and neither explanation reflects well on its reliability as an audio evaluator.

Healer Alpha, OpenRouter's own entry, produced errors and NA results for a significant share of songs, making direct comparison difficult. Between a model that can't tell nails on a chalkboard from a decent track and one that couldn't finish the test, the experiment ends on a note more honest than most AI benchmarks manage.