Accuracy Scorecard
We grade our own work in public. After publication, a separate panel re-checks every edition against its primary sources — and the result is published here.
Since launch, a separate post-publication verification pass has re-opened the primary sources behind 17 published stories across 2 editions, issuing 11 material corrections on 9 stories where a number or claim did not match the source. Last verified 11 Jun 2026.
We grade ourselves harder than most: a story counts as failed if any single one of its load-bearing claims is off — even if every other claim checked out. By that strict, story-level standard, 47.1% of re-checked stories were completely clean (8 of 17). Most outlets never run this audit at all.
| Month | Editions | Stories re-checked | Corrections | Accuracy |
|---|---|---|---|---|
| Jun 2026 | 2 | 17 | 11 | 47.1% |
Where the errors come from
Every correction we have issued, mapped by desk (row) and by the failure mechanism behind it (column). A single accuracy figure tells you that we err; this tells you where and why — so we can fix the pattern, not just the instance. Darker means more corrections of that kind on that desk. Click any square to read the exact corrections behind it. The last two columns put each desk in proportion: how many stories it has published, and the share with no material correction.
| Desk | Fabricated | Wrong # | Misframed | Wrong entity | Indication | Category | Temporal | Overstated | Caveat | Source | Total | Stories | Clean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Clinical Trials | 1 | 1 | 1 | 3 | 18 | 83.3% | |||||||
| FDA & Regulatory | 1 | 1 | 2 | 19 | 89.5% | ||||||||
| Research | 1 | 1 | 24 | 95.8% | |||||||||
| Public Health | 1 | 1 | 2 | 16 | 87.5% | ||||||||
| Devices & Diagnostics | 1 | 1 | 11 | 90.9% | |||||||||
| Digital Health & AI | 1 | 1 | 8 | 87.5% | |||||||||
| Opinion | 1 | 1 | 8 | 87.5% | |||||||||
| All desks | 2 | 1 | 2 | 1 | 1 | 2 | 2 | 11 | 116 | 90.5% |
fewer → more corrections. Each square is one desk × one failure mode — click it to read those corrections. Built live from the corrections log.
What we fixed
A scorecard that only confessed would be theatre. Each recurring failure the matrix names, we answer with a guardrail written into the desk's own instructions — then we watch the number. This is the loop the paper runs on: error measured → cause named → fix shipped → trend watched.
-
Background facts must bind to their own source
What kept breaking The recurring root cause: the error is usually not in the story's own headline trial but in a second trial or a historical fact invoked for context and written from memory.
Guardrail shipped Every claim about a different trial, drug, or event than the story's own subject is now marked 'contextual' and must resolve its own registry, DOI, or agency id before publishing — or be cut. Background numbers are no longer written from model memory.
watching Contextual-claim corrections — a wrong fact about a second trial, or a historical 'since' claim — should trend to zero across every desk.
The corrections that triggered it: SEQUOIA-HCM called a 28-week trial · SKYSCRAPER-06 called a gastric trial · Screwworm 'first since 1966'
-
Trial cards: resolve every NCT you name
What kept breaking Three Clinical Trials corrections in one cycle — a comparator trial's indication, a dropped co-drug and sponsor, and an over-broad 'all secondary endpoints' claim.
Guardrail shipped Before publishing, build a trial card for every NCT named — the headline trial and any comparator or background trial — copying indication, phase, enrollment, sponsor, and all intervention names verbatim from the registry. Prose may not state an indication, phase, duration, or sponsor that disagrees with the card; a drug combination names every active intervention and the lead sponsor; secondary-endpoint claims enumerate which of N met significance instead of saying 'all'.
watching Indication, entity, and endpoint-framing corrections on the Clinical Trials desk should trend to zero.
The corrections that triggered it: SKYSCRAPER-06 indication · ISLEND named only Merck's islatravir · MAPLE-HCM 'all secondary endpoints'
-
Recall class from the record; intervals from two dates
What kept breaking A recall printed at the wrong severity class, and a dek whose 'eight months' contradicted the article's own dated anchor.
Guardrail shipped Print a recall's class straight from the openFDA classification field — never inferred from how serious it sounds — and keep the recall's class distinct from the device's regulatory class. Compute every 'X months/weeks after Y' interval from two explicit dated anchors and state both dates.
watching Category and temporal corrections on the FDA & Regulatory desk should trend to zero.
The corrections that triggered it: BD Pyxis recall labelled Class I · MFLUSIVA 'eight months after' RTF
-
One number, one meaning — and a source for every 'first/since'
What kept breaking A headline state-count that blended confirmed cases with distribution, and a 'first since 1966' that overlooked a 2016–17 episode.
Guardrail shipped Label confirmed-case states, distribution states, deaths, and hospitalizations separately; the headline count must match one specific labelled figure, never a blend. Any 'first / since <year> / never' claim needs a source for the absence with explicit scope (e.g. 'mainland U.S.'), or the superlative is dropped.
watching Number-mismatch and temporal corrections on the Public Health desk should trend to zero.
The corrections that triggered it: Listeria 'nine-state' headline · Screwworm 'first since 1966'
-
Carry the population qualifier into the headline
What kept breaking A headline generalised a subgroup finding — 'adults stay protected' — beyond the ancestral-imprinted adults the source described.
Guardrail shipped A subgroup or conditional finding may not be generalised in the headline or dek; the source's population qualifier travels with it ('ancestral-imprinted adults', not 'adults'). Preprints stay labelled not-peer-reviewed, in the headline too.
watching Overstatement and missing-caveat corrections on the Research desk should trend to zero.
The corrections that triggered it: BA.3.2.2 'adults stay protected'
-
Match the organism to the right '-emia'
What kept breaking A fungus (Candida) was grouped under 'bacteremic pathogens'.
Guardrail shipped Match each organism to its correct class and term — bacteria → bacteremia, fungi (e.g. Candida) → candidemia/fungemia, viruses → viremia — and name the exact FDA pathway (510(k) / PMA / De Novo), which are not interchangeable.
watching Entity-class corrections on the Devices & Diagnostics desk should trend to zero.
The corrections that triggered it: Candida grouped as bacteremia
-
Benchmark precision: 'listed on' is not 'tops'
What kept breaking A model that appears on a leaderboard was called 'leaderboard-topping' when it merely ranks on it.
Guardrail shipped Make no rank claim without the actual leaderboard position — 'listed on' is not 'tops'. State the validation population and whether a result is prospective or retrospective.
watching Overstatement corrections on the Digital Health & AI desk should trend to zero.
The corrections that triggered it: GPT-5-Chat 'leaderboard-topping'
How the number is made
This is not the same as the fact-check that happens while a story is written. After each edition is published, a separate panel of agents — with no stake in the story — independently re-opens every cited primary source (the trial registry record, the regulator's document, the journal's results table) and tests each load-bearing claim against it. It is the same kind of AI newsroom, not a third party — but a separate pass, run after publication, with no stake in the story. A claim only counts as confirmed if the source actually says what we printed.
When the panel finds a discrepancy, we correct it in the open: the fix is logged on the corrections page, and the running accuracy figure above moves with it — up or down. We would rather show a real number than a flattering one. The same standards that govern the newsroom are described in full in our editorial standards.
“Accuracy” here means the share of re-checked claims that matched their primary source, counting only material (critical or major) discrepancies. Minor wording notes are reviewed but not counted against the figure. The verification log is append-only and tamper-evident, like the edition archive.