VibeBench: 1,000 Engineers Judge AI Models by Experience

Standard Agents launched VibeBench, a project that asks 1,000 engineers to evaluate new AI models based on actual experience rather than benchmark scores. The premise is simple. Published benchmarks get overfit until they stop telling you anything useful about real performance. Some traditional benchmarks still correlate with real-world performance, but those are the exception. Most published benchmarks have become optimization targets first and quality measures second. The gap between benchmark scores and actual capability can be significant, especially as models are optimized for specific tests rather than general utility. Community reaction has been mostly positive. People are glad someone is trying to cut through benchmark inflation. And the pushback misses the point. The interesting bet here is structural. Rather than building a better automated test, Standard Agents is gambling that a big enough pool of human opinion will smooth out individual biases and produce more honest signal than any leaderboard. One thousand engineers won't agree on everything. But aggregate opinion is harder to game than a fixed test suite. Given how quickly the project is iterating, we'll probably find out if the bet pays off.