Research Blog · Lab Notes · 2026-06-07

The Corpus-Shape Trap

Document-search systems answer from what your library contains — not from what the answer should be. When the right answer isn't in the corpus, retrieval still returns something: the closest thing it can find. That isn't a bug. It's the defining property of retrieval-augmented search, and the most common reason an analyst loses trust in a system that's working exactly as designed.

BDS tells you where to look — not what to conclude.

The Trap, In One Example

We asked about the 737 MAX. We got a school bus.

A real query, against a real corpus we run internally:

“What was MCAS and how did it contribute to the 737 MAX crashes?”

The corpus was a 30-document mix weighted toward NTSB highway incident reports. Vector search ran. Keyword search ran. The top five results were chunks from a school bus crash report.

Why? “Crash” is a high-frequency word in NTSB highway material. “MCAS” appeared nowhere in the corpus. Retrieval did exactly what it was built to do: find the chunks closest to the query, plus chunks where the keywords match. School-bus crash chunks were the closest available match — not because they were correct, but because nothing closer existed.

The RAG model — the part that synthesizes an answer from the retrieved chunks — read those chunks and said, correctly: “The provided context is about a school bus crash, not the Boeing 737 MAX or MCAS.”

Why This Happens — And Why It Isn't Really Fixable

Retrieval has two phases. Only one of them can say “nothing.”

Match

Find the K chunks most similar to the query, where similarity is a weighted combination of vector distance (semantic) and keyword overlap (BM25). This phase cannot return “nothing.” It returns the top K, period — whether or not your question has a home in the library.

Read

Hand those K chunks to the language model and ask it to synthesize an answer. This phase can — and on a well-prompted system, will — say the chunks don't answer the question. Good. But the chunks still appear above the answer, looking confidently relevant.

A casual user clicks the first result, reads a paragraph that uses familiar words, and walks away with the wrong story. The fix isn't to make retrieval refuse-to-return — that breaks more than it fixes, and the math doesn't support it cleanly anyway. The fix is to make the situation legible, so the user can see “this corpus doesn't have what you're asking about” without having to deduce it from second-order signals.

How To Recognize It

Four signals. None decisive alone. All four at once means trap.

BDS surfaces the retrieval's own weakness instead of hiding it. When these fire together, the system is telling you the corpus is empty-handed.

Signal	What it looks like	What it means
all bm25	No vector or both hits in the Channel column — only keyword matches.	The query had no semantic neighbor in the corpus. Only word overlap survived.
CORPUS_GAP	An orange-banded Result Critic warning at the top of results (or WEAK).	A heuristic engine looked at the retrieval shape and flagged it as not-actually-relevant.
no citations	The RAG ✓ column is all · — the model saw the chunks but cited none.	The synthesis model deliberately chose not to use any retrieved chunk.
honest answer	The synthesized answer says “the context does not contain…” or drifts to an adjacent topic.	The model is doing its job — telling you the truth about what it was given.

Together, these are the system's way of saying: “We looked. There isn't anything here. The closest match we found has keyword overlap but no semantic connection. Don't read these as relevant.” Listen to the signals.

Each term above — Channel, CORPUS_GAP, the RAG citation column — is defined in the Nuance Glossary.

What To Do When You See Them

In order of effort.

The corpus-shape trap is recoverable the moment you stop treating “results returned” as “question answered.” Each step below costs a little more than the last.

Re-phrase in the corpus's vocabulary

Search with a term the documents actually use — “MCAS angle of attack,” not “anti-stall system.”

Check the Project State panel

It surfaces top entities, anchor presence, and per-anchor mention counts. If your subject isn't there, your search won't find it either.

Load a corpus that contains the answer

Sometimes the lesson is “wrong corpus loaded,” not “wrong question asked.”

Accept the absence as a finding

“This isn't covered in this library” tells you something real about what the library is — and where to look next. Not a failure; the system honoring what's there.

Why The Trap Matters For Evaluation

A confidently-wrong answer is worse than no answer.

If you're evaluating a document-AI system for a private deployment, this is one of the things to test for directly.

A system that silently returns confidently-wrong results when asked about something it doesn't know is a worse failure mode than a system that returns no results. The first builds false belief; the second builds correct calibration.

The trust signals above — Channel column, Result Critic banner, RAG citation column, honest synthesized answer — exist because we got burned by exactly this. Early versions of BDS would return keyword-only matches and the answer panel would dutifully synthesize a confident-sounding response. Operators trusted it. Operators got the wrong story. We added each signal as a direct response to that failure mode.

The lesson generalizes. When evaluating any RAG-style system, don't ask “how well does it answer questions in its corpus?” Ask “how honestly does it tell you when a question is outside its corpus?” The first question gets you a model that's been benchmarked. The second gets you a system that won't quietly mislead you in production.

Where To Look. Not What To Conclude.

The system surfaces the signals. The analyst reads them.

BDS can tell you the chunks it returned are weak matches. It cannot tell you whether weak matches mean “this question is not in the corpus” or “this corpus has nothing for anyone.” That's an analyst judgment. Our job is to surface the signals; yours is to read them.

Discuss beta access

Provenance

Source example: the P001-30 internal test corpus referenced in the BDS Behaviors & Nuances reference, §3 (“The corpus-shape trap”). Supporting material: the BDS User Guide “Statistical considerations,” “Why a result might be missing,” and “Result Critic banner” sections. The worked query and the school-bus retrieval are reproduced from a real run; no proprietary or private-individual data is involved.

— V.I. lab notes, 2026-06-07

← Back to the Research Blog