Aggressive AI scrapers are making it kinda suck to run wikis

Jonathan Lee of Weird Gloop, which operates some of the largest video game wikis on the internet including Minecraft, Old School RuneScape, and League of Legends, published a detailed technical account on March 13, 2026 describing how AI scraper bots have become an existential infrastructure challenge for the wiki ecosystem. Without continuous active mitigation, Lee writes, bots would consume roughly 10 times more compute resources than all legitimate human traffic combined — even as that human traffic includes tens of millions of daily pageviews. Lee estimates that approximately 95% of all server issues across the wiki ecosystem in recent months have been caused by aggressive scrapers, with the Wikimedia Foundation publicly acknowledging operational impacts and some smaller independent wikis being knocked completely offline.

The problem has evolved well beyond the "official" crawlers like OpenAI's GPTBot, Anthropic's ClaudeBot, and Perplexity's PerplexityBot, which at least self-identify in their User Agent strings and can be blocked with standard tools like Cloudflare or nginx. The more damaging traffic has adapted to masquerade as legitimate Google Chrome browser sessions, routing requests through residential proxy networks that cycle through millions of IP addresses belonging to ordinary ISPs like Comcast, AT&T, and Charter. Bad actors have also exploited trusted services — using Google Translate's URL tool and Facebook's facebookexternalhit link preview to launder scraper requests through Google and Meta IP ranges, at times forcing Weird Gloop to disable Google Translate functionality entirely for their wikis. The scrapers themselves use naive breadth-first crawling strategies that ignore robots.txt and sitemaps, landing on billions of low-value MediaWiki URLs — old revision histories, edit screens, special pages — that bypass caching and are 50 to 100 times more expensive to serve than standard article content.

Mitigation options remain a patchwork of imperfect tools. Cloudflare browser challenges and services like Anubis have become widespread, though Lee notes that determined bots pass these checks roughly 10% of the time — enough to cause serious damage during high-volume spikes that can hit 1,000 or more requests per second and are nearly indistinguishable from traditional DDoS attacks. More sophisticated techniques include JA4 TLS fingerprinting, which can detect non-browser clients even when HTTP headers are spoofed, and behavioral heuristics that flag traffic missing the secondary requests for fonts, CSS, and analytics that real browsers generate. Others have explored proactive approaches like serving optimized content to AI agents via HTTP content negotiation. A community-level solution analogous to DNS-based email spam blocklists has been floated in operator discussions, though no standardized system exists yet.

Lee is careful to warn against the most aggressive countermeasures. Fandom's decision to require logins as a bot-defense measure resulted in approximately a 40% drop in new contributor activity — Fandom's experience makes the trade-off concrete. The open, low-friction model that lets wikis function as public knowledge resources is exactly what makes them vulnerable. The post frames the current scraping landscape as a collective action problem: the open-web content that LLM training pipelines depend upon is being destabilized by the very industry consuming it, and the operators bearing the infrastructure costs have little leverage over the actors responsible.