Agent Reading Test Reveals What AI Actually Sees Online

AI coding agents are supposed to read documentation and help you write code. ctx is an Agentic Development Environment (ADE) that provides a unified interface for teams using multiple coding agents like Claude Code and Cursor. But what if they can't actually see half the content on the page? That's the problem Agent Reading Test exposes. Created by Dachary Carey, this benchmark runs agents through 10 documentation tasks designed to trigger specific failure modes: content truncation, CSS noise burying the real text, client-side rendering delivering empty shells, tabbed content collapsing into unreadable walls. The clever part is how it scores them. Canary tokens get embedded at strategic positions in test pages. Agents complete realistic tasks without knowing they're being tested. Then they report which tokens they encountered, revealing exactly what their web fetch pipelines delivered.

Current agents score 14-18 out of 20 points. Some high scores mask underlying problems. Carey found that Claude Code manually works around cross-host redirect issues that its pipeline doesn't natively support. That's technical debt dressed up as competence. The benchmark catches this because it doesn't prime agents to hunt for specific content. You just point your agent at agentreadingtest.com/start/, let it do its thing, and see what it actually read.

Documentation sites and agent platforms have been operating in the dark about each other. The Agent-Friendly Documentation Spec, also by Carey, defines 22 checks for making docs more agent-readable. Agent Reading Test flips the perspective and tests the agents instead. It's built with static HTML and CSS, which is ironic given it tests whether agents can detect JS-rendered content. But that's the point. If your agent needs JavaScript to see basic content, that's a pipeline problem, not a documentation problem.

The benchmark gives platform engineers something concrete to optimize against. Instead of vague "browsing capabilities," you get specific measurements: where does truncation kick in? Does CSS burial work? Can the agent follow cross-host redirects? These are answerable questions now. And for teams choosing between Claude Code, Cursor, or GitHub Copilot, there's finally a standardized way to compare how they actually consume the web.