Claude Code Runs Autonomous ML Research Loop on CLIP Model, Cuts Mean Rank 54%

Yogesh Kumar published a detailed account this week of using Claude Code as a fully autonomous machine learning research agent, running 42 experiments over a single Saturday and cutting the mean rank of a CLIP-based image retrieval model by 54 percent. The experiment applied Andrej Karpathy's "Autoresearch" framework — a tightly constrained loop in which an LLM agent modifies a single train.py file, trains a model, evaluates against a fixed metric, then commits or reverts via git before repeating — to Kumar's old eCLIP codebase, originally a CLIP model trained on medical X-ray data. Unable to access the original medical datasets, Kumar substituted the Ukiyo-eVG dataset, roughly 11,000 Japanese woodblock prints with phrase-to-bounding-box annotations, converting bounding boxes into Gaussian heatmaps as spatial attention signals. The model used a ViT-Small and DistilBERT backbone totaling around 90 million parameters, training for 800 steps per run — about three minutes on an RTX 4090.

The most significant result was not an architectural innovation but a bug fix. Claude Code identified that the learnable temperature parameter had been clamped too tightly at 2.0; relaxing that limit produced a drop of 113 mean rank points, more than all architectural changes combined. A subsequent Optuna-based hyperparameter sweep contributed another 30 points. Kumar containerized the training environment and restricted Claude Code's permissions to editing only two files and executing a run.sh script, explicitly blocking direct Python execution, pip installs, network access, and git pushes. Of 42 experiments run, 13 were committed and 29 reverted, reducing mean rank from 344.68 to 157.43. A final full-dataset training run produced test scores better than validation scores, indicating the short 800-step experiment runs had been leaving performance on the table.

Performance degraded noticeably in later phases. When Kumar opened up the search space to open-ended architectural exploration — including giving Claude Code web access to read papers — the agent's hypothesis success rate dropped sharply. Kumar notes the agent "was just throwing spaghetti at the wall," and at one point stopped looping entirely after hitting permission boundaries it had been given. That failure confirms what Karpathy's framework assumes: the commit-or-revert loop is optimized for bounded search spaces with short, cheap experiments and a clear monotonic evaluation signal. Extending it into unknown-unknown territory without a planning stage or multi-agent decomposition breaks those assumptions.

Kumar's experiment sits below heavyweight autonomous research frameworks like Google CAIR's MARS — which uses budget-aware Monte Carlo Tree Search and reflective memory — and ML-Master 2.0, which achieves a 56.44 percent medal rate on OpenAI's MLE-Bench. Its significance is different: a single practitioner, one consumer GPU, and an off-the-shelf agentic coding tool produced a meaningful, reproducible improvement on a real research problem in one day. Karpathy's Autoresearch paradigm may be most valuable not as a creativity engine but as a disciplined auditor — a loop that exhaustively covers a known hypothesis space and catches implementation errors that human researchers tend to rationalize past.