Harnesses, Testing, and the Economics of LLM-Powered Code Translation

The early weeks of 2026 produced two heavily scrutinized demonstrations of LLM-powered autonomous software translation: Cursor's January 14th post on scaling long-running autonomous coding — which included translations of a browser, Java LSP, Windows emulator, and Excel — and Anthropic's follow-up on building a C compiler using a team of parallel Claude instances. Both generated significant attention, and sharp criticism. The Cursor browser translation drew well-deserved technical skepticism, and Anthropic's C compiler, despite compiling the Linux kernel, famously failed on a basic "Hello World" example. Alperen Keles, a University of Maryland PhD student and Datadog engineer, argues in a February 11 analysis that these failures say nothing about what the models can do. His position, informed by his own production experience building BitsEvolve at Datadog, is that the bottleneck is harness maturity — the human-designed evaluation infrastructure surrounding the models — not the models themselves.

Keles's core claim is simple: LLMs do not translate software. They propose candidate translations, which a concrete evaluator then accepts or rejects. He uses the infinite monkey theorem as a deliberately absurd illustration — LLMs just provide a far better sampling distribution over possible translations than random generation would. The economics follow directly: translation cost equals cost per iteration multiplied by expected iterations until success, plus harness engineering and oversight. Better harnesses reduce iteration count; cheaper models reduce per-iteration cost. By that framing, the high-profile demos failed not because the models gave up, but because the harnesses were not rigorous enough to guide them to a correct result within a viable budget.

The verification technique Keles identifies as most critical is differential testing — comparing behavioral equivalence between source and target code across large, randomized input spaces — alongside property-based testing more broadly. Both have been used by compiler, database, and browser researchers for decades; they are now entering mainstream engineering through their role in AI translation pipelines. Keles is careful about their limits: differential testing operationalizes functional equivalence, but says nothing about performance, security, or any property that cannot be reduced to observable input-output behavior. That gap defines the next practical problem — getting LLM-driven pipelines to produce target code that is not merely semantically equivalent but faster or easier to maintain. Datadog's BitsEvolve, which Keles helped build, already does this: an LLM-backed evolutionary agent with formal verification, live production traffic shadow evaluation, and WebAssembly hot-swapping has achieved 270 to 541 percent improvements on targeted time-series aggregation workloads autonomously.

The ecosystem Keles surveys has settled on the same architectural pattern — an LLM proposal engine coupled to a domain-specific evaluator in a closed feedback loop — across a range of research groups and production teams. ShinkaEvolve from Sakana AI, accepted at ICLR 2026, applies it to algorithm discovery. AlgoTune, a NeurIPS 2025 benchmark, measured it against numerical programming tasks and found a 1.72x average speedup. UC Berkeley's ADRS framework demonstrated 13x speedups in load balancing and 35 percent cost savings in cloud scheduling. MIT CSAIL's Glia, led by Professor Mohammed Alizadeh, applied the pattern to GPU inference cluster scheduling, matching two weeks of human engineering effort in roughly two hours. Martin Fowler has published dedicated writing on harness engineering as a discipline, and in February 2026, OpenAI released a harness engineering paper describing a million-line codebase managed through approximately 1,500 automated pull requests with no manually written code. By early 2026, harness quality — not raw model capability — is where the competitive leverage is. BitsEvolve's production numbers are the clearest evidence of what that looks like when it actually works.