How Brex Tests Its AI Audit Agent: By Committing Simulated Fraud

The expense report has been a vehicle for corporate fraud since the expense report was invented. What Brex published this week is an account of how hard it turns out to be to catch someone who knows what they're doing — and how the company built a testing framework capable of answering that question at scale.

Brex's AI audit agent reviews every corporate expense on its platform in real time, an LLM making consequential financial judgments continuously. In a post authored by engineering lead Rohit Mehta, the company lays out why standard software testing breaks down under those conditions. The problems are structural: the same expense can produce different reasoning chains across runs; historical fraud data is sparse and skewed toward patterns that were already caught; a detection gap can quietly cost customers tens of thousands of dollars before it registers as something worth fixing.

The solution is a synthetic company purpose-built to surface failures. Every expense archetype — team lunches, hotel stays, software subscriptions — is defined with precise valid ranges for amount, timing, documentation, merchant, and budget code. Fraudulent expenses are generated by applying labeled violations Brex calls mutations: an amount inflated past policy limits, a charge on a restricted day, a budget code that doesn't match the purchase. Because each mutation carries an unambiguous ground-truth label, the team can run thousands of scenarios and measure precision and recall across violation types, expense archetypes, and company profiles. It produces a statistical picture of exactly where the agent is sharp and where it falls short — something unit tests have never been able to provide for a system like this.

The adversarial dimension of the framework pushes further. Brex models social-engineering attempts — fake internal memos citing nonexistent policy sections, fabricated C-suite authorization claims — and builds in correlated sloppiness: a fraudulent expense has a 25% chance of also missing its receipt, because real fraudsters are inconsistent in recognizably human ways rather than just in the specific way you happened to be testing for. A deterministic seed generates months of synthetic spending history in hours. Results are graded not only on catch rate but on audit quality: citation accuracy, tone calibration, the completeness of the agent's reasoning.

The simulation layer is sophisticated. What makes it structurally significant is how it connects to the day-to-day development workflow. A scheduling agent runs simulations continuously and files failures as detailed Linear tickets. A ticket-to-PR agent proposes fixes for engineer review. A CI agent sits in the pull request pipeline, analyzes proposed changes, designs targeted test scenarios, and posts regression reports before anything merges. Mehta frames simulation not as a QA checkpoint but as a constraint built into how the team ships — closer to type-checking than to an audit.

There is an honest caveat embedded in what the post doesn't claim. Synthetic data, however carefully designed, models fraud rather than capturing it. Real adversaries adapt to detection systems in ways that no mutation catalog can fully anticipate, and a framework optimized against known violation types will always have a blind side. What Brex has published is a rigorous lower bound on agent reliability, not a proof of correctness.

That narrower claim is still more than most of the industry can make. Consequential agentic systems — the kind that move money, influence medical decisions, or touch legal outcomes — are largely still evaluated on intuition and spot-checks. Brex's willingness to publish the full architecture, including its limits, is the kind of concrete contribution that tends to move a field forward faster than any number of conference talks about responsible AI.