Research Blog · Methodology · 2026-06-27

Building a Test Corpus Is the Hard Part

Tuning a document-intelligence pipeline is the visible work. The harder, quieter work is assembling data realistic enough that the results mean anything. No public dataset arrives shaped like a real client's library — mixed formats, connected entities, regulatory noise, scanned exhibits — so we had to build the corpora ourselves, from public sources, and learn what makes a test set honest.

A benchmark is only as trustworthy as the corpus underneath it.

Why Test Data Is the Hard Part

Real corpora have a shape. Public data doesn't ship with one.

A client's document library is not a tidy folder of clean PDFs. It's born-digital briefs next to scanned exhibits, spreadsheets next to regulatory filings, and — crucially — it's connected: the same people, companies, and events recur across documents, which is what lets an entity graph form at all.

Public datasets don't come that way. They arrive as one format, one topic, or one clean dump with the messy parts already removed. Test a pipeline on data like that and you measure the pipeline against an easy case it will never meet in production. To trust a result, we needed corpora that reproduce the awkward properties of real ones — and that meant collecting and shaping them deliberately.

Two Systems, Two Shapes

We collect from two source families — and they fail in opposite ways.

Every test corpus we run is drawn from one of two acquisition systems. They're complementary precisely because they have opposite default shapes: one is connected for free, the other is connected only if you work for it.

System A · Single-Case

SEC v. Ripple

The full federal docket (SEC v. Ripple Labs, S.D.N.Y. 1:20-cv-10832) pulled politely from the public CourtListener / RECAP archive — roughly 408 PDFs: born-digital motions and orders, scanned exhibits that route through OCR, and SEC inline-XBRL financial exhibits.

Shape: one case, so every document shares the same anchors — Ripple, the SEC, Garlinghouse, Larsen. The entity graph is connected for free. A subset of real PDFs is also converted into DOCX and TXT containers to exercise the multi-format ingest path — authentic content, manufactured format.

System B · Multi-Domain

Public topic ecosystems

A two-phase, artifact-first collector — discover candidate URLs, then download confirmed artifacts (PDF/TXT/XML/JSON/CSV/XLSX/DOCX) — across six public topics: Boeing 737 MAX (~693 files), COVID-19 public health (~658), the Federal Reserve / FOMC (~611), NTSB investigations (~510), and Tesla Autopilot / FSD (~225).

Shape: unrelated topics, so randomly mixed documents are topic-isolated — near-zero cross-anchor co-occurrence and a disconnected, unimpressive graph. Boeing-MAX + NTSB became our primary multi-domain validation set, where entities genuinely bridge across domains (NTSB ↔ NHTSA ↔ FAA).

All sources are public and collected politely — rate-limited, identifying User-Agent, manifest per run. Court defendants named here are public figures in a litigated case.

What Went Wrong Along the Way

The non-obvious problems — worth reading before you build your own.

None of these were visible at the outset. Each one cost a debugging cycle and changed how we collect.

Problem	What happened	How we handled it
XBRL noise	SEC inline-XBRL filings saved as `.txt` were ingested raw, flooding the entity graph with machine markup (`us-gaap:`, `*Member`, namespace URIs).	Source-level cleaning: content-sniff markup-as-text and split EDGAR multi-document submissions to keep the narrative. Naïve text extraction isn't enough for regulatory filings.
format gap	No public corpus arrives as a natural PDF + DOCX + XLSX mix, so the multi-format extractors had nothing realistic to chew on.	Convert real PDFs into other containers — authentic content, manufactured format — one output format per source document, no duplicates.
corpus shape	Random sampling of a multi-topic pool produced disconnected graphs with almost no cross-anchor co-occurrence. Shape, not size, decides graph quality.	A bridge-aware sampler that deliberately biases toward multi-anchor documents — in one test it produced roughly 120× more cross-anchor co-occurrence than random selection.
noise floor	Raw entity and mention counts swung 30–600% at N=30, swamping any real A/B signal.	Lean on stable metrics — wikilink-repair % and per-anchor presence — and run reliable comparisons at N ≈ 100–300.
missing anchors	A subject anchor absent from the sampled documents (e.g. "737 MAX" when no MAX document was drawn) simply vanished from the graph.	Anchor preservation protects entities that were extracted; it can't synthesize ones that were never sampled. The sampler screens for required anchors and refills.
reproducibility	Early un-seeded sampling runs couldn't be reproduced, so an A/B couldn't be trusted to be measuring the change and not the dice.	The sampler now pins a seed and writes a provenance sidecar with every run — same seed, same draw, same graph.

Terms above — anchors, wikilink repair, the corpus-shape trap — are defined in the Nuance Glossary.

What We Solved

A repeatable way to manufacture a realistic test.

The collection and sampling tooling now turns two public source families into corpora that reproduce the properties that matter: mixed formats, regulatory noise that's been handled rather than ignored, connected entity graphs, and runs that reproduce exactly.

Collect politely from public sources

Paginated docket fetch and a two-phase artifact-first crawler, each writing a manifest, rate-limited with an identifying User-Agent.

Clean and shape at the source

Split EDGAR submissions, sniff markup-as-text, and synthesize the multi-format mix from real content.

Sample for shape, not just size

Stratified cross-topic mixes with a bridge-aware mode that pulls multi-anchor documents on purpose, plus exact-N screening and refill.

Pin the run

Seed + provenance sidecar on every sample, so an A/B compares the change — not two different draws.

What We Haven't Solved — Yet

Two sources is a partial answer, and we'd rather say so.

The honest limit: every corpus we test against is still randomly drawn from the same two source families — one litigation docket and one set of six public topics.

That's enough to exercise the properties we care about — single-case connectivity from System A, cross-domain bridging from System B — and the bridge-aware sampler and seeded provenance make those runs reproducible and meaningful. But it is not the same as broad, independent coverage. Two source families share collection methods, era, and register; a pipeline can quietly overfit to their idiosyncrasies, and a metric that looks stable across them may simply reflect what they have in common.

We're explicit about this because the whole point of these notes is that a benchmark inherits the blind spots of its corpus. The path forward is more independent sources — additional dockets, other public-agency ecosystems, genuinely different document registers — added one at a time, each with its own provenance record, so we can tell when a result generalizes and when it was an artifact of where the data came from. Until then, read our numbers as what they are: documented, reproducible, and drawn from two wells.

Evidence Before Conclusion — About Our Own Data, Too

We hold the test corpus to the same standard we hold a result.

A document-AI system should tell you where to look, not what to conclude — and the same discipline applies to how we evaluate it. Knowing exactly where our test data came from, and where it falls short, is what keeps a benchmark from quietly misleading us.

Discuss beta access

Provenance

Condensed from the internal "Test Corpora — Provenance & Methodology" record. Sources are public: the SEC v. Ripple Labs docket via CourtListener / RECAP, and public-agency artifacts across six topic ecosystems (Boeing MAX, COVID-19, Federal Reserve, NTSB, Tesla FSD). File counts, the ~120× bridge-sampling figure, and the 30–600% count-swing at N=30 are reproduced from documented internal runs. No proprietary or private-individual data is involved; named court defendants are public figures in a litigated case. Companion reading: The Corpus-Shape Trap.

— V.I. lab notes, 2026-06-27

← Back to the Research Blog