AI Translation Demos Are Really Just Fancy Guessing Machines

The wave of 'autonomous' software translation demos that kicked off 2026 has a dirty secret. In a February technical post, software engineer Alperen Keles makes a compelling case that none of it is actually autonomous. What vendors are calling AI translation is more precisely this: models propose code, human-designed test harnesses accept or reject it, and the cycle repeats until something passes. The LLM is a smarter guesser. The judge is a human who wrote the tests.

That distinction matters more than it sounds. If correctness is determined by the test harness, then the hard intellectual work — figuring out what 'correct' even means for a given codebase — stays with the engineer. The model handles guessing. The human handles truth.

Keles points to two January announcements as exhibits. Cursor's post on scaling long-running autonomous coding showed off translations of a browser, a Java language server, a Windows emulator, and Excel. Anthropic's parallel-Claudes C compiler generated considerable buzz — at least until it failed on a Hello World program despite, somehow, having compiled the Linux kernel in the demo. Keles calls both immature early attempts and argues the real bottleneck isn't model capability but harness quality and compute budget. Better tests, more tokens, more iterations.

The economics follow from that logic. Translation cost is roughly a function of cost-per-iteration times expected iterations to convergence. As models improve and token prices fall, projects that were previously too expensive become viable. Keles expects increasingly capable demos throughout 2026 — not because of some qualitative leap in AI reasoning, but because the math of iteration keeps improving.

There's also a useful distinction between translating applications versus libraries. Applications deliver more direct business value. Libraries have compounding downstream effects across every codebase that depends on them, but also carry greater risk when something breaks quietly.

The most interesting section of Keles's post looks past translation entirely. The next frontier, he argues, is LLM-driven code optimization — where the objective shifts from semantic equivalence to raw performance improvement. Early signals include tools like BitsEvolve, ShinkaEvolve, Algotune, ADRS, and Glia. Optimization is harder than translation: there's no single right answer, just a performance landscape to explore. But the underlying architecture — harness plus model, iterate until good — is the same. The difference is the reward function.