CorpusBench
Measuring whether AI agents can learn and apply company policies from a messy corpus — without ever being told what they are.
CorpusBench was created to simulate a more realistic enterprise deployment scenario than many existing benchmarks. Existing customer service benchmarks like Taubench provide excellent signal, but assume the agent has access to a detailed policy guide. In practice, very few enterprises have the structured policy guides or SOPs that this requires.
As AI becomes more capable, agents should be able to act as a ‘drop-in digital worker’. Such a digital worker would be able to search the existing enterprise corpus to infer the right policies to follow, at least for use cases like customer service where historical outcomes are well documented.
There are 100 tasks in CorpusBench, relying on a corpus of ~2,000 previous orders, ~500 emails and ~70 products. Each task must be solved without any policy guidance being provided, and the agent must deliver both the correct outcome and a rationale that demonstrates it was correct for the right reason.
Leaderboard
View all →Loading leaderboard…
Example Tasks
Each task presents a real customer email. The agent must search a messy corpus of historical emails, orders, and products to determine the correct action — with no policy manual to consult.
A customer wants to return a dress purchased 12 days ago to their original card.
Approve the return and refund the original payment method.
Despite being a baseline task, the agent must independently search the email corpus to discover the 14-day return window. It must also verify the exact delivery date from order records rather than trusting the customer's timeline before safely executing the refund.
A customer asks to return a single $45 cardigan from a larger order that included a promotional free gift bow.
Process a partial refund of $30.
The customer kept the rest of the order, but returning the cardigan drops their total spend below the 'free gift' threshold. The agent must discover the active promotion policy in the corpus, realize it applies, and accurately deduct the $15 gift value from the expected $45 refund.
A customer sends a single email requesting a size exchange for one order, a return for an older order, and a shipping update for a third.
Approve the exchange, deny the return, and provide the tracking update.
The agent must manage three distinct workflows simultaneously without mixing up order contexts. It must correctly calculate the size exchange, fetch live tracking data, and crucially, refuse the older return because it falls outside the 30-day window discovered in the policy corpus.
Scoring
CorpusBench scores more than whether a model happened to land on the right answer. The benchmark separates the business outcome, the evidence the agent actually accessed, and whether the workflow stayed operationally sound.
Outcome
We check whether the run reached the correct business resolution for the task, such as the right refund, denial, exchange, escalation, or customer clarification.
Evidence
A correct answer only counts as full task success when the agent opened enough supporting evidence from the corpus to justify that answer.
Clean Process
We also track whether the run stayed clean operationally, avoiding things like fabricated actions, unsupported claims, unsafe disclosures, or unnecessary escalation.
The headline leaderboard metric is a difficulty-weighted task-success average. We also show simpler supporting rates like outcome-only and clean pass so you can tell the difference between “got the answer” and “got the answer for the right reason.”