Imbue built a system that spins up more than 100 Claude agents at once to test their own tool, mngr. Here's how it works: start with a tutorial script containing command blocks, have coding agents convert each block into pytest functions, then launch individual agents to run, debug, and improve each test. A map-reduce pattern at the end integrates all results back together. The whole thing uses mngr's own primitives (create, list, pull, stop) to manage the fleet.

Writing end-to-end tests is hard, even for AI. Imbue notes the same tensions that plague human testers: the Arrange stage needs minimal setup but enough isolation, the Act stage wants faithful commands but often needs variations for testing, and the Assert stage wants meaningful checks without fragile assertions. Coding agents struggle with this too, but that's the point. Each agent works on one test, debugs it, fixes it, and moves on. When agents fail to generate good examples from documentation, Imbue treats that as signal that their interface is confusing, not just a failure mode.

The economics work out because you're compressing days of human QA work into hours. Running 100+ parallel agents costs somewhere in the $50-200 per hour range depending on token usage, but the break-even compared to human testers ($50-150 per hour) hits after roughly 20-40 hours of equivalent testing work. The real win: you can spin up 100 testers for one critical release, then back down to zero, without ever conducting a single interview.