Research Blog · Lab Notes · 2026-06-29

Facts in Code, Judgment in the Model

We put a language model on both ends of the OSINT workflow in BDS — a triage that recommends which leads to chase, and a brief that reads the findings back to you. Two features, one underlying lesson, learned twice: where you let a model use judgment, and where you absolutely must not.

If a fact can be computed, compute it. Save the model's judgment for the part that actually needs judgment.

Two Features, One Lesson

The probing was never the hard part.

The OSINT subsystem in BDS takes the people, companies, handles, phones, and domains pulled out of a private corpus and probes them against public sources. Running the probes is the easy part. The hard part is everything around a result — deciding which candidates are worth probing at all, and reading a tree of raw findings without drowning in noise.

So we added a language model at both ends: a pre-probe triage that recommends which candidates to chase, and a post-probe brief that reads the findings the way an analyst would. Building them surfaced the same lesson twice, from opposite directions.

Lesson One — The Model Trusted The Wrong Field

It recommended probing a law firm's domain. Confidently.

The first version of the triage was confident and wrong. Given a docket of public securities litigation, it enthusiastically recommended probing an outside law firm's web domain and a couple of institutional phone numbers — exactly the leads that produce a hundred hostnames and zero insight.

The cause was mundane and instructive. Each candidate carries a linked_to field: the nearest entity it co-occurred with during extraction. It's a useful breadcrumb, but it is proximity, not meaning. The law firm's domain sat near a person-of-interest anchor in the text, so the model read "linked to the subject" as "relevant to the subject" and promoted it. It was doing exactly what a reasonable reader does with a field labelled "linked to." The field was just noisier than its name implied.

The fix was not a better prompt. The fix was to stop asking the model to infer facts it shouldn't have to infer. Before the candidate list ever reaches the model, we now compute a small set of deterministic flags in plain code and hand them over as ground truth:

FlagWhat the code checks — no inference
third_party_domainThe domain belongs to an outside party — a law firm, a vendor — not a subject of the matter.
subject_domainThe domain actually stems from a subject organization in the corpus.
freemailA generic mail provider, not an organizational identity.
phone_metadata_onlyA phone that recon can only enrich with carrier/region metadata — nothing identifying.
infra_recon_targetA probe that yields infrastructure, not identity.

These aren't judgments. They're checks — string stems, a freemail set, a tool-class lookup. The model still does the reasoning ("is identity enrichment useful for this matter?"), but it reasons over facts it can trust instead of guessing at them. With the flags in place, the law-firm domain stopped getting recommended, because the model could now see it was a third party rather than having to deduce it from a proximity hint.

The rule we walked away with: if a fact can be computed, compute it. Don't spend the model's judgment budget re-deriving something a five-line function knows for certain.

Lesson Two — The Most Useful Answer Was "Don't Bother"

None of this is worth doing here.

We were admiring the recommendation engine on a corpus of well-known public figures — a closed, public securities matter — when a colleague said the quiet part out loud: none of this is worth doing. The subjects were already public, the record was retrospective, and every recommendation, however well-triaged, was a distraction.

That reframed the whole feature. OSINT isn't a step in a linear pipeline; it's a conditional branch. Its value depends entirely on what the matter needs. Vetting unknowns, attributing an identifier, running diligence on people the documents haven't characterized? External research is gold. Reading a finished record about figures the world already knows? It's noise with extra steps.

So the triage now leads with a verdict, and the candidate list is subordinate to it. The discriminator is one question: does the matter need external information about subjects the corpus hasn't already pinned down?

VerdictWhat it meansReviewer posture
worthwhileThe matter likely needs external information about subjects the documents haven't fully characterized.Probe the recommended candidates first.
marginalSome external upside, but narrow. A few identities may be worth a look; most candidates won't move the matter.Probe selectively, if at all.
not worthwhileThe subjects are already public and well-documented, or the matter is a retrospective record. External probing mostly adds noise.Skip OSINT and stay in the corpus.

On a public-figure docket, the model uses its own knowledge of the named subjects, recognizes them as already-documented, and returns not worthwhile with high confidence — even when no one told it what the case was about.

Two design choices keep that honest. First, the verdict always states the assumption it made about your objective, because intent is rarely knowable from the corpus alone — and it names the one condition that would flip the call, so you can correct it in a single read instead of arguing with it. Second, it still lists a few recommendations even on a no-go, labelled "if you proceed anyway" and left unselected. The point isn't to refuse; it's to be straight with you about the upside before you spend the time.

The Division Of Labor

Facts moved into code. Judgment stayed in the model — and had to show its work.

Both fixes are the same shape. The deterministic flags moved facts out of the model and into code. The verdict kept judgment in the model but made it account for its assumptions. What's left for the language model is the genuinely model-shaped work — weighing whether external enrichment fits a matter, reading a messy evidence tree the way an analyst would — and nothing it can get wrong by trusting a mislabelled field.

That's the line we keep coming back to across the whole BDS pipeline, not just OSINT: the system points where to look; the reviewer decides what it means. A language model is good at pointing. It is dangerous exactly when you let it quietly decide a fact it should have been handed.

Where To Look. Not What To Conclude.

The synthesis points. It never concludes.

Every recommendation and brief is grounded in evidence you can open, and the reviewer keeps the final call. The model's job is to spend your attention well — including telling you when the honest answer is to spend it elsewhere.

Provenance

Written from the actual build. The OSINT synthesis layer described here — the candidate triage, the deterministic flags, and the worthiness verdict — runs on a local 14B model inside the BDS OSINT subsystem. The law-firm-domain example and the public-securities docket are drawn from a real run against public-record litigation; no proprietary or private-individual data is involved.

— V.I. lab notes, 2026-06-29