SOB Benchmark: 95% Valid JSON, 70% Correct Values

Interfaze just released a benchmark called SOB that shows a real problem with LLM structured outputs. Most models pass JSON validation 95% or more of the time. But when you check whether the actual values inside that JSON are correct, accuracy drops 15 to 30 points. That's a big deal for anyone building pipelines that parse invoices, medical records, or meeting transcripts into structured data. A hallucinated invoice total or misordered dates can break downstream systems silently. SOB tests 21 models across text, image, and audio inputs using seven metrics. Value Accuracy, Faithfulness, and Perfect Response are where models actually differentiate. The structural metrics (JSON Pass, Path Recall, Structure Coverage, Type Safety) cluster near the ceiling for everyone. The benchmark normalizes image and audio to text before scoring, isolating extraction ability from vision or speech recognition. Harder schemas carry more weight in the final ranking, so models can't coast on easy outputs. The top performers, GPT-5.4, GLM-4.7, and Qwen3.5-35B, sit within a point of each other on overall score. Even the best model only achieves a Perfect Response rate around 47% though. More than half the time, something in the output is wrong. Schematron-8B passes JSON 98.7% of the time but lands just 73.1% on Value Accuracy, a 25.6 point gap. That's the space where existing benchmarks have been giving models a free pass. Hacker News commenters flagged some gaps. Opus 4.7 and Gemini 3.1 Pro are absent from the leaderboard, and users questioned the selection criteria. Some argued that since top models score so similarly, the differentiation SOB offers has limited practical value. Fair criticism. But the JSON-versus-accuracy gap alone makes this worth attention. If your production system treats valid JSON as correct JSON, you're flying blind.