Contra Labs just dropped a benchmark that actually tries to measure something useful about AI-generated creative work, similar to how **VibeBench** judges AI models. The Human Creativity Benchmark splits creative evaluation into two signals: convergence, where professional creatives agree on basics like readable typography and functional layout, and divergence, where they legitimately disagree because taste isn't objective.

The study pulled from Contra's network of over 1.5 million independent professional creatives who have collectively earned more than $250M on the platform. These aren't random crowdworkers. They're working designers, illustrators, and digital artists whose aesthetic judgments are commercially validated. Evaluators assessed AI outputs across five domains: landing pages, desktop apps, ad images, brand assets, and product videos. They used pairwise comparisons, scalar ratings, and open-ended feedback across three creative phases: ideation, mockup, and refinement.

The core finding is blunt. No current model is reliably both correct and steerable. Models can follow instructions. They can produce visually appealing work. Consistently nailing both remains unsolved.

The benchmark found that prompt adherence and usability produce higher agreement among evaluators. Visual appeal sits at the other end, where taste drives legitimate disagreement that shouldn't be smoothed away.

This matters because of what the researchers call mode collapse. Give multiple models the same creative brief and they converge on safe, averaged aesthetics instead of distinctive directions. Creative professionals need differentiated output for trend awareness and rapid exploration of distinct visual directions. A model that defaults to a single safe option fails that workflow even when the output looks technically competent.

Contra Labs is the research division of Contra, a commission-free freelance platform, which explains how they accessed such a large pool of verified professionals for the evaluation.