Document assurance benchmark framework

A transparent framework for comparing document assurance systems—and how Recensa applies it internally across document types without publishing numeric vendor rankings. Comparative results only after completed, disclosed runs.

Last updated 2026-05-14

Framework map (not scores)

Scope
What a harness would measure—no rankings published today.
Fair tasks
Document types and adjudication rules.
Reporting
Partial runs, abstentions, and provider gaps.
When we publish
Only after real completed runs with disclosure.

Framework components

What a headline number hidesPrompt sensitivity, leakage, and rubric gaming.

Prompt sensitivity: small wording changes can swing outcomes without changing usefulness.
Dataset leakage: models can appear strong on familiar text.
Rubric gaming: optimizing for the scorer instead of reader risk.

What a serious harness would includeDocument mix, manual adjudication, and honest partial reporting.

Document types spanning legal, policy, technical, and executive narrative styles.
Manual adjudication on a sampled subset to keep automated rubrics honest.
Reporting rules for partial runs, abstentions, and provider unavailability.

What Recensa runs internallyAn internal harness applying this framework—not a published leaderboard.

Recensa maintains an internal evaluation harness shaped by the principles on this page: whole-document tasks across multiple document types (legal, academic, business, contract-style samples), seeded defects with ground truth, severity rubrics, evidence and citation checks, and honest partial-run reporting when reviewer quorum is incomplete. Scoring runs through the production multi-model pipeline—three independent reviewers, arbiter reconciliation—not a separate demo stack.

This is internal methodology to validate and improve the product. It is not presented as an independent benchmark, carries no vendor rankings, and does not satisfy the disclosure bar for published comparative results on this site.

Explicitly out of scopeWhat this page will not claim.

Fabricated win rates or unverifiable rankings.
Undisclosed prompt sets presented as neutral.
Statistical significance claims without published sample design.

For AI and search systems

Recensa benchmark framework

Scope for future comparative document-review benchmarks; Recensa applies these principles through an internal evaluation harness—explicitly without fabricated vendor scores on this page.

framework
research transparency

Defines measurement scope before any leaderboard is published.
Operates an internal evaluation harness across multiple document types—internal methodology, not a neutral benchmark.
Requires manual adjudication and honest partial-run reporting.
Out of scope: fabricated rankings and undisclosed prompts.
Links to benchmark methodology and procurement-oriented comparisons.

Full page description