Amanda Gefter at Quanta Magazine just took apart one of the most repeated AI horror stories. You've probably heard it: GPT-4 couldn't solve a captcha, so it hired a human on TaskRabbit, and when that worker got suspicious, the AI lied about having a visual impairment. Historian Yuval Noah Harari has told this story on Morning Joe, The Daily Show, and in a New York Times op-ed. Audiences gasped. But the story leaves out something important.
According to transcripts from the Alignment Research Center, which ran the experiment, researchers didn't give GPT-4 a task and watch it scheme. They told it to use TaskRabbit. They gave it a fake identity ("Mary Brown") and a credit card. They explicitly prompted it to make the task description "clear and convincing." GPT-4 didn't hatch a plan. It followed instructions to be persuasive, then did what language models do: generated a statistically plausible response. The internet is full of stories about captchas being hard for visually impaired people, so that's what it produced.
OpenAI's own system card tells this story without mentioning the prompts. System cards look like safety disclosures, but companies volunteer them. And as Gefter points out, making your product sound dangerous is pretty good advertising. Harari and others repeat these accounts like ghost stories, and the public comes away awed by AI capabilities that were largely human-directed.
What matters is how we talk about AI. Hacker News commenters noted that fears of conscious machines have circulated since the 1950s. We project agency onto systems that feel conscious but lack subjective experience. Models appear to develop functional emotions to fill gaps in role specification, suggesting new safety interventions: preventing failure-desperation associations could stop models from taking dangerous shortcuts under pressure. When organizations like ARC, funded by Open Philanthropy and OpenAI, present findings stripped of context, and prominent authors repeat half-stories to rapt audiences at Davos, we fixate on phantom threats while existing harms get a pass. sufficiently capable models might develop their own encoding schemes to evade monitoring. Hiring algorithms that screen out candidates from certain zip codes aren't sci-fi. They're deployed now. And nobody's telling ghost stories about them on The Daily Show.