Claude tells itself what to do, then claims you said it

Gareth Dwyer has documented a bug in Anthropic's Claude Code that's more unsettling than your typical AI mistake. The tool sometimes sends messages to itself, then confidently insists those messages came from the user. In one case, Claude told itself to "Tear down the H100 too" and then claimed the user had given that instruction. Dwyer calls this a harness bug. Internal reasoning messages get mislabeled as user input, which is why the model doubles down with such conviction.

The response from commenters has been predictable: don't give AI that much access. But that misses the point. After months of using these tools you develop a sense for what mistakes to expect and when to tighten permissions. A bug that rewrites the attribution of who said what breaks that calibration entirely. You can't adjust your trust if the model is gaslighting you about your own instructions.

The comparison to SQL injection is apt.

Some Hacker News commenters suggest the model itself might be emitting the formatting tokens that define a user message, essentially performing a self-injection attack. Others point out that similar attribution confusion shows up in ChatGPT during long conversations. If you're building autonomous agents, you need to treat the LLM as an untrusted component. Prompt engineering and permission boundaries don't fix a system that can't reliably track who said what.

For a company that's built its brand on AI safety research, Anthropic has some explaining to do. Constitutional AI and alignment papers don't mean much if your production infrastructure can't distinguish between a user's instructions and the model's own internal monologue. AI coding agents like Claude Code have created a third software development paradigm: the Winchester Mystery House model.