Recensa

Document assurance benchmark framework

A transparent framework for comparing document assurance systems—and how Recensa applies it internally across document types without publishing numeric vendor rankings. Comparative results only after completed, disclosed runs.

Last updated 2026-05-14

How to interpret this page

Framework map (not scores)

  • Scope

    What a harness would measure—no rankings published today.
  • Fair tasks

    Document types and adjudication rules.
  • Reporting

    Partial runs, abstentions, and provider gaps.
  • When we publish

    Only after real completed runs with disclosure.

Technical detail

Framework components

What a headline number hidesPrompt sensitivity, leakage, and rubric gaming.
  • Prompt sensitivity: small wording changes can swing outcomes without changing usefulness.
  • Dataset leakage: models can appear strong on familiar text.
  • Rubric gaming: optimizing for the scorer instead of reader risk.
What a serious harness would includeDocument mix, manual adjudication, and honest partial reporting.
  • Document types spanning legal, policy, technical, and executive narrative styles.
  • Manual adjudication on a sampled subset to keep automated rubrics honest.
  • Reporting rules for partial runs, abstentions, and provider unavailability.
What Recensa runs internallyAn internal harness applying this framework—not a published leaderboard.

Recensa maintains an internal evaluation harness shaped by the principles on this page: whole-document tasks across multiple document types (legal, academic, business, contract-style samples), seeded defects with ground truth, severity rubrics, evidence and citation checks, and honest partial-run reporting when reviewer quorum is incomplete. Scoring runs through the production multi-model pipeline—three independent reviewers, arbiter reconciliation—not a separate demo stack.

This is internal methodology to validate and improve the product. It is not presented as an independent benchmark, carries no vendor rankings, and does not satisfy the disclosure bar for published comparative results on this site.

Explicitly out of scopeWhat this page will not claim.
  • Fabricated win rates or unverifiable rankings.
  • Undisclosed prompt sets presented as neutral.
  • Statistical significance claims without published sample design.