ADR-00520 May 2026 · 3 min read

Hallucination and data leakage: how we treat them as engineering problems, not vibes

Two failure modes — making things up, and leaking what shouldn't leave — show up in every production LLM system. We treat both as defects with measurable rates, not character flaws of "the AI".

llmsecuritygdprdiscipline

Context

Every conversation with a buyer who's wary of AI lands on the same two fears: will it make things up? and will it leak our data? Both fears are warranted. Both are also engineering problems with engineering answers, not philosophical ones.

A study we've internalised: in unconstrained domains, modern LLMs hallucinate cited facts somewhere between 3% and 27% of the time depending on the model and task — and the rate is higher precisely when the model sounds most confident. A separate body of work on data leakage shows that prompt-and-response pairs containing PII can resurface in unrelated sessions if the upstream provider trains on user data. Neither risk is hypothetical.

The default posture in the LLM-app world is to treat hallucination as inevitable and data leakage as the provider's problem. That position is defensible. It is not, however, the position we operate from.

Decision

We treat both failure modes the same way we'd treat any production defect: with measurable rates, named mitigations, and runbooks for when the mitigations fail.

For hallucination:

Retrieval before generation. No fact-shaped answer leaves the system without a citation pulled from a vetted store. The LLM rewrites; it doesn't author. RAG isn't a buzzword — it's the boundary between "the model said it" and "the system stands behind it".
Constrained outputs where the answer space is finite. JSON schemas with enums for classification, regex-checked tokens for identifiers, post-generation validators for numeric ranges. If the model can return only one of seven values, it cannot hallucinate an eighth.
Confidence-aware UX. When the system isn't sure, the UI says so. Users don't see a fabricated answer wrapped in a confident sentence — they see "I'm not certain, here are three sources that might help" and the option to escalate.
Eval suites that ship with the code. Every release runs against a corpus of known-answer questions and known-trap questions. A regression in either is a release blocker.

For data leakage:

Pseudonymisation at the boundary. Already documented in ADR-001. PII never reaches the provider in cleartext. The mapping table never leaves our trust boundary.
Opt-out of training, in writing. Every provider contract specifies that prompts and completions are not used to train upstream models. For Anthropic and OpenAI this is the default on their API tiers; for newer providers, we read the DPA before signing.
Audit logs for every external call. Append-only, queryable, retained for the regulatory window. If a buyer ever asks "what data left our system on the 3rd of March between 14:00 and 15:00", we can answer that question.
Server-side filtering for high-risk classes. Card numbers, NHS numbers, NI numbers, passport numbers — pattern-matched and stripped before any prompt is built. False positives are preferable to leaks.

Consequences

Operational cost. Pseudonymisation adds ~80ms p95 (ADR-001). Retrieval adds ~120ms on a warm cache. Eval suites add ~3 minutes to CI. Total budget: under 500ms per request, under 5 minutes per release. We pay it.

Honesty cost. The constrained-output approach means the system sometimes refuses to answer a question it could plausibly answer if we let it freelance. We've decided "I don't know" is a better failure mode than "I'm confidently wrong". Some users prefer the wrong answer because it feels helpful. We don't optimise for them.

Posture. When a regulated buyer asks "how do you handle hallucination", the answer is short. Retrieval-first generation. Constrained outputs. Confidence-aware UX. Eval suites. Each one is a one-line answer with a written ADR behind it.

Alternatives considered

Trust the model. Defensible at very low stakes (e.g. an internal brainstorming tool used by engineers who can sanity-check output themselves). Indefensible at any stake higher than that.
Human-in-the-loop on every answer. Works. Doesn't scale. Becomes a moat for whoever's prepared to staff the loop.
Switch to a smaller, more constrained model. Sometimes the right answer — Haiku-class models hallucinate less on simple tasks. Usually a complementary choice rather than a replacement.

When this is the wrong answer

When the cost of being wrong is genuinely zero — a creative-writing assistant, a brainstorming partner, a draft-generator that a human will edit before anything ships. Those use cases tolerate (and arguably benefit from) the model freelancing. Most of what we build does not.