Autoresearch: Karpathy's AI Agent Iterates on LLM Training Code While You Sleep

Andrej Karpathy dropped autoresearch last week — a project that hands your LLM training script to an AI coding agent and tells it to find improvements overnight. The setup is stripped to essentials: one file (train.py, pulled from his nanochat project), one metric (val_bpb, validation bits per byte), and a fixed five-minute wall-clock budget per experiment. The agent edits the code, runs the job, checks the number, keeps it if it's better, reverts via git if it isn't, and repeats. Twelve cycles an hour, around a hundred by morning.

The first round of outputs didn't stay confined to a research repo. In March 2026, autoresearch results were merged into nanochat's Time-to-GPT-2 leaderboard, cutting the speedrun record from 2.02 hours to 1.80 hours on an 8xH100 node. The agent had found improvements across architecture, hyperparameters, and optimizer configuration that human researchers hadn't caught. The researcher's only lever in the process: program.md, a plain Markdown file describing what the agent should explore — which effectively replaces your role in the experimental iteration loop with a document.

The community didn't wait around. Within weeks of release, @miolini published autoresearch-macos, swapping out the hardcoded FlashAttention-3 dependency for PyTorch's native SDPA to get it running on Apple Silicon. @trevin-creator went further with autoresearch-mlx, cutting PyTorch and CUDA entirely and rebuilding the pipeline on Apple's MLX framework. Overnight runs on M4 Max and Mac Mini hardware hit val_bpb values as low as 1.294 — and surfaced something worth noting: hyperparameter recipes don't port cleanly across hardware. On Apple Silicon, smaller and faster-training models outperform the larger configurations that win on CUDA clusters within the same time budget.

Karpathy described the repo as 'the story of how it all began' — a prelude, he implied, to autonomous AI swarms running frontier research across compute megastructures. That's a lot of mythology to attach to a Python training script. But the underlying point is already demonstrated: an agentic loop can compress what used to be days of manual experimentation into a single overnight run. The loop is closed.