When Andrej Karpathy published autoresearch in March 2026 — an agent that autonomously iterates on LLM training code to improve loss — the obvious next question was whether the same idea could work outside of machine learning. Pi-Autoresearch, an open-source project by GitHub user davebcn87, is an answer: a generalized version of that loop, built as an extension for the 'pi' agent IDE, that can target any metric a shell command can return.
The mechanics are simple. The agent picks a change, applies it, runs a benchmark, and decides whether to keep or revert the commit. Then it does it again. The target is whatever you configure: test suite execution time, JavaScript bundle size, build speed, Google Lighthouse scores. The loop runs until you stop it.
The architecture separates two distinct jobs. A global extension provides three tools — `init_experiment`, `run_experiment`, and `log_experiment` — along with a live status widget and a `/autoresearch` dashboard inside the pi IDE. A per-domain skill called `autoresearch-create` handles setup: it interviews the user about their optimization goal, infers the right benchmark command, and generates two session files the agent uses as working memory.
Those two files are where the project gets interesting. Every run appends a JSON line to `autoresearch.jsonl` — metric value, keep/revert decision, git commit hash, description of the attempted change. A companion `autoresearch.md` tracks the session objective, strategies tried, dead ends, and wins. A fresh agent instance with no conversation history can resume the session exactly where the last one stopped.
That design matters because long-running autonomous loops fail constantly. Context windows fill, processes crash, the user closes the terminal. Pi-Autoresearch doesn't try to prevent interruptions — it just makes them not matter. The state lives in files, not in the agent's memory.
The extension-plus-skill split is also worth noting: reusable infrastructure lives in the extension, domain knowledge lives in the skill, so the same installation can serve multiple optimization targets without modification. Whether the tool finds a broad audience depends on how it handles production codebases — where search spaces are messier and the cost of a bad automated commit is higher — but the architecture is portable enough that the answer doesn't have to be the same for every project.