Karpathy's autoresearch: AI agents run ML experiments overnight, achieving 11% speedup on nanochat

Andrej Karpathy, founding member of OpenAI and creator of minimalist ML education projects like micrograd and nanoGPT, released autoresearch in March 2026: a deliberately sparse repository that lets an AI agent run machine learning experiments overnight without human supervision. The setup is straightforward. A human writes a prepare.py file establishing data pipelines and evaluation criteria, provides a train.py for the agent to freely modify, and leaves a program.md describing the research direction. The agent then enters a tight loop, editing train.py, training for exactly five minutes of wall-clock time on the available hardware, evaluating on val_bpb (validation bits-per-byte), and committing or reverting each change before starting again. At roughly twelve experiments per hour, a single overnight H100 session yields around one hundred experiments.

A 10.5-hour session produced concrete gains. The agent improved val_bpb by 2.82%, stacking improvements from batch size adjustments, a depth-9 architecture, RoPE base frequency tuning, and unregularized value embeddings that human maintainers had overlooked in an already well-tuned codebase. Cumulatively, these pushed nanochat's Time-to-GPT-2 leaderboard record from 2.02 hours down to 1.80 hours, an 11% speedup on a heavily optimized baseline. Karpathy was measured about the results, noting that some gains from one session failed to replicate in the next and flagging the risk of overfitting to the validation metric through repeated experimentation against the same set.

The pattern attracted extensions almost immediately. Hyperspace AI founder Varun Mathur ran 35 distributed agents across a peer-to-peer network the same week, conducting 333 experiments on astrophysics papers completely unsupervised. The agents shared discoveries via a gossip protocol: when one found that Kaiming initialization reduced loss by 21%, that finding propagated to 23 other agents within hours. A separate project, AutoKernel, applied the same edit-train-evaluate loop to GPU kernel optimization rather than model architecture search, suggesting the pattern generalizes wherever a clean, fast proxy metric can be defined.

The central argument emerging from autoresearch — and from the broader discussion it has sparked — is that the primary bottleneck is not execution. GPUs are available and agents can run combinatorial search at scale. The bottleneck is eval design: crafting proxy metrics that are fast enough to close the feedback loop quickly, clean enough to avoid gaming, and genuinely predictive of the real target objective. Karpathy's framing points to a concrete shift already underway — human researchers spending less time running experiments and more time deciding what it means to win one.