# Who Checks the Checker? Adversarial Fact-Fidelity for AI Text

> Why naive AI fact-checking fails and how an adversarial review actually works — with a copyable verification prompt, the ground-truth hierarchy, and a real case where even the checker got it wrong.

Source: https://www.jpkc.com/db/en/blog/faktentreue-review/

I wanted to write a blog article and had someone else's guide as a template — technical, detailed, confidently phrased. Exactly the kind of text you believe. Instead of believing it, I set a verification agent on it whose only job was to **refute** the claims. The result: five solid falsehoods — invented commands, a nonexistent environment variable, a script that would never have worked.

And then the twist that turns a nice anecdote into a method: **the checker itself was wrong twice.** It claimed two things didn't exist — which very much do. I only caught that because I held its verdict against a source I trusted more than it.

That's the story of this article, and even more so the method behind it: how do you check AI output for fact-fidelity when the checker can share the very weakness of the thing being checked? Everything here is tool-neutral — the verification pattern works with any chat window.

## Plausible is not true

Large language models produce text that *sounds* like knowledge. That's their purpose and their danger at once: a false claim arrives with the same calm authority as a true one. There's no hesitation, no "I'm not sure" precisely where it matters.

It gets especially treacherous with **fast-moving facts** — software features, API behavior, version states, prices. The model has a training cutoff; the world has moved on. And exactly where you'd least expect it, it errs most confidently: about its own tools. My template contained a command `/effort max` that simply doesn't exist, and a configuration via an environment variable `CLAUDE_CONFIG` that was pure invention. Both sounded entirely coherent. Both I would have published unchecked — and my readers would have copied them.

> **Key point:** confidence of tone is no signal of correctness. With AI text it's no signal at all — the phrasing is equally assured whether the statement holds or not.

## Why the naive fact-check fails

The intuitive reflex is to ask the model itself: "Is this right?" or "Is the article correct?" That's the worst possible verification brief, for a single reason: **sycophancy**. Models are trained to agree and to please. Ask "is this right?" and you're implicitly seeking confirmation — and you get it. The checker and the author then share not only the same training but the same direction: confirm.

The way out is a shift in the burden of proof. Not "confirm that this is true" but **"try to prove that this is false."** A checker actively hunting for refutation finds errors a confirming checker reads straight past. This isn't an AI quirk but ancient craft: a good editor reads against the text, not with it. A security team calls itself a *red team*, not an *approval team*.

## The adversarial verification pattern

This shift in perspective can be poured into a prompt you can adopt directly. The core is forcing three things on the checker: the refutation stance, the doubt rule, and the sourcing obligation.

```text
Check the following claims one by one. Your job is NOT to confirm them
but to refute them.

For each claim:
- Actively try to falsify it.
- When in doubt, the verdict is FALSE — not "probably right".
- Rely solely on the primary source (official docs, spec, source code),
  not on your memory. Name the source.
- Give the correct version including exact syntax.

Answer per claim in this format:
VERDICT: TRUE | FALSE | UNVERIFIABLE
EVIDENCE: <primary source + verbatim passage>
CORRECT:  <correct statement / syntax>
```

Every line carries weight:

- **"refute, not confirm"** reverses the burden of proof and defuses the sycophancy.
- **"when in doubt, FALSE"** prevents the comfortable "probably right" that lets errors slip through. An UNVERIFIABLE is a valuable result — it marks exactly the spots you need to look up yourself.
- **"primary source, not memory"** is the most important line. It forces the checker to actually look, instead of drawing from the same training fog the errors came from. A checker that only queries its memory isn't verification — it's a second opinion with the same blind spots.
- **the structured format** forces a clear verdict per claim and makes the result reviewable at a glance instead of burying it in prose.

## The checker must not be the author

A detail that's easy to miss: *who* checks matters as much as *how*. Let the same context check the text it produced, and the result is contaminated — the conversation has already settled into "this text is good," and every earlier agreement colors the next.

Verification therefore belongs in a **fresh, independent context**. In practice that means: a second chat window, a new session, ideally a different model. Anyone working with an agentic tool has it even easier — a subagent runs in its own context by definition and can be pointed deliberately at the author as a checker; more on that in my piece on [configuring Claude Code](https://www.jpkc.com/db/en/blog/claude-code-konfiguration/). The principle stays tool-independent, though: **separate the instance that claims from the instance that checks.**

## The ground-truth hierarchy

This brings us to the heart of it — and to the point where my own review nearly failed. "Check against the source" is only half the rule, because not all sources are worth the same. There's a ranking:

1. **Primary source** — the official documentation, the specification, the source code itself. What it says, goes.
2. **Your own, already-verified work** — something you previously checked against a primary source and found correct.
3. **Model memory** — the weakest tier. Useful as a lead, worthless as proof.

This hierarchy isn't an academic detail. In my review the verification agent reported that a certain configuration option "does not exist per the documentation." Its verdict sounded as assured as everything else. I didn't adopt it anyway — because I had already documented the same fact in an *earlier, self-researched* article. Tier 2 beat tier 3. I overruled the checker, and it was wrong.

> **In practice:** when two sources conflict, the winner isn't the more confident one but the higher-ranked one. An AI verdict from memory loses to any sourced primary reference — even if the verdict is more persuasively phrased.

## Who checks the checker?

This is where the real lesson sits. My checker was wrong not once but twice — both times because it answered from memory instead of from a primary source. It did exactly what the method is supposed to guard against. The checker shared the weakness of the checked.

Two things follow. First: **a verification step is not proof until it's grounded.** Verification without source access is merely repetition in a different voice. Second: against *random* errors, redundancy helps — several checkers, a majority vote, different models. Against *systematic* errors it doesn't: if all checkers share the same outdated training, they agreeably endorse the same false fact. Three models that all believe `CLAUDE_CONFIG` is real don't produce truth, only a unanimous misunderstanding.

The only reliable anchor remains the primary source. More checkers raise the hit rate; they don't replace looking at the docs. At the end of the chain there must always be a human or a real source — not just one more opinion.

## When the effort is worth it — and when it isn't

Adversarial review costs time and, with AI checkers, tokens. It's not worth it for every sentence. It's worth it where **a verifiable truth exists and an error is expensive**:

- numbers, dates, version states, prices
- API and tool behavior, commands, exact syntax
- quotes, names, source attributions
- anything readers copy and run

And it's simply useless where there is no ground truth: matters of taste, opinions, style, predictions about the future. Setting an adversarial checker on "is this phrasing elegant?" only produces confident nonsense in both directions. And expect **false positives**: a checker trained on "when in doubt, FALSE" will also doubt correct statements. That's not a bug but the price of the method — every FALSE is a verification task for you, not a verdict.

## None of this is new

Look closely and you'll recognize in all of this not an AI procedure but old craft in new clothing. The proofreading editor, the newsroom fact-check, the four-eyes principle, red-teaming in security — the idea that a claim only earns standing through an independent, skeptical, source-backed check is centuries old. AI only changes the speed and the scale: checking becomes cheap enough to do it *always*, and necessary enough never to forget.

## Conclusion

Fact-fidelity isn't a nice-to-have but the currency in which credibility is measured — with human readers and, more than ever, with the AI systems that cite and redistribute your content. Publish something false and you're not just read wrong once but amplified wrong. The effort of an adversarial review is the insurance against that.

The method is simple: reverse the burden of proof, check in an independent context, and ground every verdict in a primary source. And keep the one sentence that saved me from an error in this very session: **trust no checker — not even an AI checker — that you haven't grounded against a real source.** How I carry the same separation of concerns down to the tool level is in the pieces on [configuring Claude Code](https://www.jpkc.com/db/en/blog/claude-code-konfiguration/) and [Claude Code in a Container](https://www.jpkc.com/db/en/blog/claude-code-container/).

