Self-Evolving Harness Beats Human-Designed Codex-CLI by 5 Points

Coding agents need a harness to sit between the model and the execution environment. Building those harnesses by hand is slow and inconsistent. A team at Peking University led by Jiahang Lin tried something different: let the harness evolve itself.

Their system, Agentic Harness Engineering (AHE), runs a closed loop built on three observability mechanisms. Component observability gives each editable part a file-level representation so changes are explicit and revertible. Experience observability compresses millions of raw trajectory tokens into a corpus the evolving agent can actually consume. Decision observability pairs every edit with a prediction that gets verified against results in the next round. Every change becomes a falsifiable contract instead of a random guess.

This matters because harness engineering has been a quiet bottleneck in the agent world. The industry obsesses over model capabilities, but the harness beats the model in practice. Self-evolving harnesses could change that, letting agents improve their own tooling without human engineers in the loop.

After ten iterations on Terminal-Bench 2, AHE raised pass@1 from 69.7% to 77.0%. The human-designed Codex-CLI harness peaked at 71.9%. Self-evolving baselines ACE and TF-GRPO fell short too. The evolved harness also transferred without any retraining to SWE-bench-verified, hitting top aggregate success with 12% fewer tokens. Cross-family gains ranged from 5.1 to 10.1 percentage points across three other model families on Terminal-Bench 2.

The ablation results tell the real story. Improvements came from structural changes to tools, middleware, and long-term memory. System prompt tweaks contributed almost nothing. AHE encodes actual engineering knowledge into the harness architecture rather than finding clever ways to phrase instructions. The machinery improves. The words stay the same.