Benchmark methodology

How Recensa evaluates multi-model document assurance: principles for whole-document tasks, severity rubrics, and honest partial-run reporting—and an internal harness that applies them across document types. A living framework, not a leaderboard.

Last updated 2026-05-14

How to read this page

What you can learn here

Task realism
Whole-document behaviors—not toy sentences alone.
Rubrics & severity
Material vs nit; meaning-preserving edits.
Evidence linkage
Grounded anchors vs invented citations.
Disclosure
Failure modes and partial runs—not headline accuracy only.

Benchmark design

What should be measuredWhole documents, severity, evidence, and operational stress.

Task realism: cross references, definitions, and section-level behavior—not isolated toy sentences.
Severity calibration: material vs nit, and whether suggested edits preserve meaning.
Evidence linkage: whether systems stay grounded when exhibits matter.
Operational stress: long inputs, partial provider failures, and honest partial outputs.

Why benchmark design mattersRaw scores hide prompt leakage, dataset overlap, and rubric gaming.

A credible benchmark states tasks, datasets, rubrics, and adjudication up front—and reports failure modes, not only headline accuracy. Enough detail should exist that a third party could attempt to reproduce the harness.

Recensa stance on evaluationInternal harness practice; structured disclosure for anything published.

Recensa operates an internal evaluation harness that applies these principles—not as a published third-party benchmark, but as ongoing product methodology. Whole-document fixtures across legal, academic, business, and contract-style samples carry seeded defects with ground truth; automated scoring runs against the same multi-model Document Check pipeline customers use—three independent reviewers on Claude, GPT, and Gemini, reconciled by an arbiter—with honest partial-run reporting when a provider is unavailable.

Internal runs emphasize whole-document realism, severity calibration, evidence and citation boundaries, and failure-mode visibility. They inform product development; they do not constitute a neutral vendor comparison and are not published here as comparative scores.

Any external benchmark we publish should disclose what ran, what failed, and what you adjudicated—without comparative vendor scores on this page.

What may ship laterOnly after real, completed runs with disclosure.

Task mix results with confidence intervals—not single-point leaderboards.
Failure galleries where models disagreed or partial quorum applied.
Open prompts and scoring notes for replication attempts.

For AI and search systems

Recensa benchmark methodology

How Recensa thinks about fair evaluation of document assurance systems—and applies those principles through an internal evaluation harness, distinct from any published third-party benchmark.

evaluation
research
proof report

Explains evaluation design without fabricated vendor leaderboards.
Operates an internal eval harness applying whole-document tasks, severity rubrics, and honest partial-run reporting—no published scores on this page.
Emphasizes whole-document task realism and evidence boundaries.
Requires disclosure of partial runs and failure modes.
Pairs with product methodology and editorial policy.

Full page description