SkyPilot ran an experiment that should change how we think about coding agents. They added a research phase to Claude Code, letting it read arxiv papers and study competing projects before writing any code.

Pointed at llama.cpp's CPU inference path, the agent found 5 kernel optimizations that pure code analysis completely missed. The result: 15% faster text generation on x86 and 5% faster on ARM for TinyLlama 1.1B. Total cost came to about $29 in cloud spend over 3 hours using 4 VMs.

Here's where code-only agents hit a wall. The agent's first wave chased SIMD micro-optimizations that barely moved the needle. Gains of 0.6% to 0.9%, all within noise.

Text generation is memory-bandwidth bound, not compute bound. A 606 MiB model running at 49 tokens/s consumes roughly 30 GB/s of memory bandwidth, nearing the hardware's limit. You can't SIMD your way out of a data movement problem, and the code itself won't tell you that. You need to understand the roofline model and recognize that batch-size-1 inference hits memory limits first.

The research phase fixed this. Between experiment waves, the agent studied ik_llama.cpp, llamafile's tinyBLAS, PowerInfer, and ExLlamaV2. It also searched arxiv for FlashAttention and operator fusion papers.

Competing codebases proved more useful than academic literature. Two of the five final optimizations came directly from patterns spotted in ik_llama.cpp and the CUDA backend. The biggest win fused three passes over flash attention's QK tile into a single AVX2 FMA loop.

This builds on Andrej Karpathy's autoresearch concept, which SkyPilot generalized into pi-autoresearch. Shopify CEO Tobi Lütke already used it to cut Liquid template parse+render time by 53%.

But those optimization opportunities lived in plain sight within the source. When the answer sits in domain knowledge, in the experience a senior engineer would bring, agents need to do homework first. SkyPilot's data shows that for production optimization work, the state of the art lives in GitHub repos, not academic papers.