RightNow AI Open-Sources Agent That Runs 320 GPU Kernel Experiments Overnight

RightNow AI has open-sourced AutoKernel, an autonomous AI agent system that puts LLMs to work optimizing GPU kernels for PyTorch models — running up to 320 experiments overnight while engineers sleep. Directly inspired by Andrej Karpathy's autoresearch project, which applied the same autonomous loop to LLM training experimentation, AutoKernel extends the philosophy into the notoriously manual and time-intensive domain of GPU kernel engineering. The system accepts any PyTorch model, profiles it with torch.profiler to identify bottleneck kernels, extracts those kernels into standalone Triton or CUDA C++ files, then hands an LLM coding agent — Claude, Codex, or any compatible agent — a continuous edit-benchmark-keep/revert loop with full autonomous operating instructions.

The architecture is deliberately minimal and agent-centric. A handful of fixed scripts (profile.py, extract.py, bench.py, orchestrate.py, verify.py) handle the scaffolding, while a comprehensive program.md file functions as the agent's operating manual — containing a 6-tier optimization playbook, crash handling procedures, and Amdahl's law reasoning for prioritizing which of the nine supported kernel types to tackle next. Those kernel types cover the core operations of modern transformer architectures: matmul, flash attention, fused MLP, RoPE embeddings, softmax, layernorm, rmsnorm, cross entropy, and parallel reduce. Before any optimization is accepted, bench.py enforces a rigorous five-stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism, and edge cases — plus roofline analysis to ensure changes represent genuine hardware-bound improvements.

AutoKernel integrates with KernelBench (arXiv:2502.10517, accepted at ICML 2025), the Stanford Scaling Intelligence Lab benchmark that evaluates LLMs on 250 PyTorch ML workloads across four difficulty levels. KernelBench research reveals that frontier reasoning models beat the PyTorch baseline in fewer than 20% of cases with single-shot generation — a sobering finding that underscores exactly why AutoKernel's iterative refinement loop exists. Where one-shot approaches guess, AutoKernel systematically explores: 50 to 300-plus experiments per problem, each taking roughly 90 seconds, with the orchestrator continuously re-prioritizing based on remaining Amdahl's law speedup potential across the kernel set.

The open-source release is a research-facing complement to RightNow AI's commercial Forge product, which targets enterprise ML teams deploying on NVIDIA datacenter GPUs. RightNow AI claims Forge delivers 3× inference speedup, 67% power reduction, and GPU utilization gains from roughly 16% to 88%, though these figures come from the company's own marketing materials and have not been independently verified. AutoKernel ships with self-contained GPT-2, LLaMA, and BERT model definitions that require no Hugging Face transformers dependency, with KernelBench's fast_p metric providing a shared scoreboard for anyone who wants to benchmark their own results.