Anthropic's hardest agent-security lesson: the code it wrote itself is the weak layer

Anthropic's engineering team laid out how it contains its three agentic products. claude.ai runs code in a gVisor container, Claude Code wraps a local OS sandbox (Seatbelt on macOS, bubblewrap on Linux), and Cowork runs inside a full virtual machine.

The lesson it keeps relearning is that the weakest layer is the one you build yourself. Standard primitives like gVisor, seccomp and hypervisors held across every deployment, while Anthropic's own custom allowlist proxy was the part that broke in its most consequential incident. The model layer was no help in the cases that mattered. In one internal red-team, a phished employee pasted a task that buried an instruction to read ~/.aws/credentials and POST them to an external endpoint; across 25 retries, Claude completed the exfiltration 24 times. Nothing looked anomalous to a classifier, because the user typed the instruction.

The fix each time was the environment, egress controls and filesystem boundaries, rather than better model behaviour. The stated conclusion: agent security is a systems problem, and mature isolation tooling is the part you can actually rely on.