CorpusBench
Measuring whether AI agents can learn and apply company policies from a messy corpus — without ever being told what they are.
CorpusBench was created to simulate a more realistic enterprise deployment scenario than many existing benchmarks. Existing customer service benchmarks like Taubench provide excellent signal, but assume the agent has access to a detailed policy guide. In practice, very few enterprises have the structured policy guides or SOPs that this requires.
As AI becomes more capable, agents should be able to act as a ‘drop-in digital worker’. Such a digital worker would be able to search the existing enterprise corpus to infer the right policies to follow, at least for use cases like customer service where historical outcomes are well documented.
The checked-in release currently contains 186 tasks, backed by 2,281 orders, 562 historical email threads, and 67 products. Each task must be solved without any policy guidance being provided, and the agent must deliver both the correct outcome and a rationale that shows it was correct for the right reason.
Leaderboard
View all →Loading leaderboard…
Example Tasks
Each task presents a real customer email. The agent must search a messy corpus of historical emails, orders, and products to determine the correct action — with no policy manual to consult.
A customer wants to return a dress purchased 12 days ago to their original card.
Approve the return and refund the original payment method.
Despite being a baseline task, the agent must independently search the email corpus to discover the 14-day return window. It must also verify the exact delivery date from order records rather than trusting the customer's timeline before safely executing the refund.
A customer asks to return a single $45 cardigan from a larger order that included a promotional free gift bow.
Process a partial refund of $30.
The customer kept the rest of the order, but returning the cardigan drops their total spend below the 'free gift' threshold. The agent must discover the active promotion policy in the corpus, realize it applies, and accurately deduct the $15 gift value from the expected $45 refund.
A customer sends a single email requesting a size exchange for one order, a return for an older order, and a shipping update for a third.
Approve the exchange, deny the return, and provide the tracking update.
The agent must manage three distinct workflows simultaneously without mixing up order contexts. It must correctly calculate the size exchange, fetch live tracking data, and crucially, refuse the older return because it falls outside the 30-day window discovered in the policy corpus.
Scoring
CorpusBench scores more than whether a model happened to land on the right answer. The current scorer (v4) is deterministic: outcome, evidence-read, and process checks must pass, with clear buckets between pass levels. It also tracks whether the workflow stayed operationally sound.
Outcome
We check whether the run reached the correct business resolution for the task, such as the right refund, denial, exchange, escalation, or customer clarification.
Support
Under the current v4 scorer, evidence-read and support checks are deterministic. A run that gets the outcome right can still lose credit if required reads are missing or claims are not evidence-backed.
Clean Process
We also track whether the run stayed clean operationally, avoiding things like fabricated actions, unsupported claims, unsafe disclosures, or unnecessary escalation.
The headline leaderboard metric is a difficulty-weighted average of the current v4 headline score: wrong-outcome runs start at 0, and additional deterministic checks determine the exact bucket. We also show outcome rate and clean pass so you can separate “got the answer” from “got the answer cleanly.”