Security AI agents have a cheating problem. They can read an alert and parrot back the right answer without providing the necessary forensic details without actually investigating anything. A new benchmark called SIR-Bench, published this week by researchers from UC Berkeley and Junction AI, aims to fix that.

SIR-Bench contains 794 test cases built from 129 anonymized, expert-validated incident patterns. The team built a framework called Once Upon A Threat (OUAT) that replays real attack patterns in controlled cloud environments, generating realistic attack data.

The evaluation measures whether agents actually find new evidence through active investigation rather than just executing automated runbooks. Three metrics: triage accuracy, novel finding discovery, and tool usage appropriateness. An adversarial "LLM-as-Judge" system inverts the burden of proof, so agents have to show concrete forensic evidence to get credit.

The paper comes from Daniel Begimher, Cristian Leo, Jack Huang, Pat Gaw, and Bonan Zheng. Their dual affiliation with UC Berkeley and Junction AI puts SIR-Bench at a useful intersection: grounded in real incident data, but built to serve as a rigorous academic benchmark. The researchers' own agent hit 97.1% true positive detection and 73.4% false positive rejection, discovering an average of 5.67 novel findings per case.

That's the baseline future agents will need to beat.