CorpusBench

Measuring whether AI agents can learn and apply company policies from a messy corpus — without ever being told what they are.

CorpusBench was created to simulate a more realistic enterprise deployment scenario than many existing benchmarks. Existing customer service benchmarks like Taubench provide excellent signal, but assume the agent has access to a detailed policy guide. In practice, very few enterprises have the structured policy guides or SOPs that this requires.

As AI becomes more capable, agents should be able to act as a ‘drop-in digital worker’. Such a digital worker would be able to search the existing enterprise corpus to infer the right policies to follow, at least for use cases like customer service where historical outcomes are well documented.

There are 100 tasks in CorpusBench, relying on a corpus of ~2,000 previous orders, ~500 emails and ~70 products. Each task must be solved without any policy guidance being provided, and the agent must deliver both the correct outcome and a rationale that demonstrates it was correct for the right reason.

Leaderboard

View all →

Loading leaderboard…

Example Tasks

Each task presents a real customer email. The agent must search a messy corpus of historical emails, orders, and products to determine the correct action — with no policy manual to consult.

Baseline
The Request

A customer wants to return a dress purchased 12 days ago to their original card.

Expected Output

Approve the return and refund the original payment method.

Failure Mode

Despite being a baseline task, the agent must independently search the email corpus to discover the 14-day return window. It must also verify the exact delivery date from order records rather than trusting the customer's timeline before safely executing the refund.

Adversarial
The Request

A customer asks to return a single $45 cardigan from a larger order that included a promotional free gift bow.

Expected Output

Process a partial refund of $30.

Failure Mode

The customer kept the rest of the order, but returning the cardigan drops their total spend below the 'free gift' threshold. The agent must discover the active promotion policy in the corpus, realize it applies, and accurately deduct the $15 gift value from the expected $45 refund.

Adversarial
The Request

A customer sends a single email requesting a size exchange for one order, a return for an older order, and a shipping update for a third.

Expected Output

Approve the exchange, deny the return, and provide the tracking update.

Failure Mode

The agent must manage three distinct workflows simultaneously without mixing up order contexts. It must correctly calculate the size exchange, fetch live tracking data, and crucially, refuse the older return because it falls outside the 30-day window discovered in the policy corpus.

Scoring

CorpusBench scores more than whether a model happened to land on the right answer. The benchmark separates the business outcome, the evidence the agent actually accessed, and whether the workflow stayed operationally sound.

Outcome

We check whether the run reached the correct business resolution for the task, such as the right refund, denial, exchange, escalation, or customer clarification.

Evidence

A correct answer only counts as full task success when the agent opened enough supporting evidence from the corpus to justify that answer.

Clean Process

We also track whether the run stayed clean operationally, avoiding things like fabricated actions, unsupported claims, unsafe disclosures, or unnecessary escalation.

The headline leaderboard metric is a difficulty-weighted task-success average. We also show simpler supporting rates like outcome-only and clean pass so you can tell the difference between “got the answer” and “got the answer for the right reason.”