A HuggingFace Project Is Ranking the AI Rankers

Someone on HuggingFace has built a leaderboard for AI leaderboards. It sounds like a punchline, but MAYA-AI/all-leaderboard is a real project — and a reasonably useful one.

The setup comes from a user called mayafree, operating under the MAYA-AI organization. The idea is to skip the expert committees and just track what the research community actually engages with. Two signals determine rankings: HuggingFace's live trending scores and cumulative likes. No editorial curation, no institutional backing. If researchers are using a benchmark and upvoting it, it floats up.

The collection covers the established names — Open LLM Leaderboard, Chatbot Arena, MTEB, BigCodeBench — alongside a number of newer entrants. FINAL Bench is the most ambitious of the newcomers, claiming AGI-level evaluation across 100 tasks in 15 domains; it recently hit the global top 5 in HuggingFace dataset rankings. Smol AI WorldCup runs tournament-format competitions specifically for models under 8 billion parameters, which reflects where a lot of practical development attention has shifted. ALL Bench takes a different approach, aggregating results across multiple frameworks into one ranking — explicitly to make it harder for developers to overfit to any single standard.

The interface is straightforward. Sort by trending to see what's gaining traction now, or by likes to see what has held up over time. Nine domain filters let you narrow things down to whatever evaluation area you actually care about. Every listed leaderboard shows its rank within the collection and its real-time position in the broader HuggingFace Spaces ecosystem.

None of the underlying infrastructure is novel. But the problem it's responding to is real and persistent: there are too many benchmarks, they're inconsistent, and developers routinely optimize for them in ways that hollow them out. Mayafree frames it as a question about measurement — that how we evaluate AI is as consequential as the AI itself — which is a fairly well-worn observation at this point. What's different here is the approach: making benchmark credibility a community-driven, automated signal rather than something gatekept by institutions or popular opinion among researchers on social media. Whether it becomes a genuine reference point or a curiosity depends on adoption. The infrastructure is there.