AI Coding Agents Can Fix a Bug. SWE-CI Asks If They Can Do the Job for Six Months.

Getting a coding agent to patch a bug is one thing. Keeping a codebase functional across six months of continuous development is something else entirely — and until now, no benchmark has seriously tried to measure the difference.

SWE-CI, developed by researchers at Sun Yat-sen University and Alibaba Group and posted to arXiv in early March 2026, fills that gap. The benchmark consists of 100 tasks drawn from real-world repositories, each tied to an authentic development history spanning an average of 233 days and 71 consecutive commits. Agents aren't presented with a frozen snapshot of a repo and asked to fix one thing. They have to work through the code as it actually evolved — new features, refactors, dependency changes, the lot.

That's a direct challenge to SWE-bench and similar evaluations, which treat software engineering as a series of isolated one-shot problems. The researchers cite Lehman's Laws of software evolution, which predict that unmaintained code degrades predictably over time, and point out that maintenance already accounts for somewhere between 60 and 80 percent of total software lifecycle costs. A patch that works today might rot quietly for months before it causes a real failure — a one-shot benchmark will never catch that.

To measure something more meaningful, the team introduces two new metrics. Normalized Change scores how closely an agent's modifications match the actual commits made by human developers at each stage. EvoScore looks at code quality sustainability across the full timeline — the idea being that a clean, extensible fix should hold up better than a hard-coded workaround as the codebase keeps moving. Evaluation runs on a dual-agent protocol pairing an Architect with a Programmer, designed to reflect how real agentic development workflows tend to be structured.

Early results are instructive without being surprising: LLM code maintenance capabilities are improving, different providers weight maintainability differently, and all of the tested models struggle with regression control over long horizons. None of them have cracked the problem of writing code that stays good rather than just working.

The dataset and evaluation code are publicly available on Hugging Face (skylenage/SWE-CI) and GitHub (SKYLENAGE-AI/SWE-CI). The authors are Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, and Bing Zhao.