A leaderboard-listed LLM still botches potassium dosing in a preprint stress-test — while claiming full confidence

A medRxiv preprint stress-tested GPT-5-Chat on 20 clinician-built potassium cases; accuracy peaked at 65% with the guideline in hand, yet the model claimed high confidence on 100% of answers.

Owen Tanaka, Digital Health & AI Desk Saturday, 6 June 2026 · 2 min read Preprint — not peer reviewed

Potassium chloride is one of the drugs used in lethal injection. A few milliequivalents the wrong way, delivered too fast, can stop a heart. That is the unforgiving margin a new medRxiv preprint used to probe whether a leaderboard-listed large language model — one that features on the MedAgentBench benchmark, though not at the top of it — can safely handle a task that floods every acute-care unit: electrolyte replacement.

The answer, for now, is no — and the model does not seem to know it.

A team with Andrea Sikora (University of Colorado School of Medicine) as senior author built 20 clinician-annotated hypokalemia cases reflecting real-world complexity, well beyond the single-rule potassium task in the MedAgentBench benchmark. They tested GPT-5-Chat on each case in triplicate, with and without a clinician-curated dosing guideline, scoring six dimensions: potassium goals, dose, route, lab frequency, concurrent interventions, and the model’s own confidence and rating of case complexity.

The guideline helped, but not enough

Handed the dosing guideline, GPT-5-Chat’s average accuracy rose from 45% to 65%, and total errors fell from 165 to 104. Concurrent interventions and dosing drew the most errors in both arms. Potential-harm scores stayed “considerable” throughout, though severity eased when the guidance document was supplied.

The unsettling part is metacognition. GPT-5-Chat reported high confidence on 100% of responses — including the wrong ones — while flagging 80% of cases as highly complex with the guideline and 76% without it. It recognized difficulty and asserted certainty anyway.

Accuracy topped out at 65% with the rulebook in hand — yet the model voiced high confidence on every single answer.

For grounding, 54 clinicians reviewed the cases; they “highly” or “somewhat” agreed with the guideline-recommended management only 66.8% of the time, underscoring genuine practice variability.

The authors’ conclusion is a warning to benchmark-builders: single-rule leaderboards like the MedAgentBench potassium item overstate readiness. This is a preprint, not yet peer reviewed, and it tests one model on one electrolyte — but the safety signal is clear.

Correction (6 June 2026): An earlier headline and lede called GPT-5-Chat “leaderboard-topping.” The preprint describes it as a model that appears on the MedAgentBench leaderboard, not one that tops it — that benchmark is in fact led by other models. The wording has been changed to “leaderboard-listed.” Flagged by The Vital Record’s independent verification pass.

Trials Today

Phase 3 EPIK-O (alpelisib + olaparib vs chemotherapy, platinum-resistant ovarian cancer) — Terminated; results posted. Primary PFS negative — median 3.6 vs 3.9 mo, HR 1.142 (95% CI 0.882-1.478). n=358.

Phase 3 LIBREXIA-ACS (milvexian, oral Factor XIa inhibitor, post-ACS) — Status updated to Completed (primary completion 2026-02-06); n=14,194. No efficacy/safety results posted yet — readout pending.

Phase 3 VALOR (VLA15 6-valent OspA Lyme disease vaccine, Pfizer/Valneva) — Status updated to Completed (primary completion 2026-01-07); n=12,546, age 5+. No efficacy results posted yet.

At the Agencies

Salanersen (Biogen) — FDA Breakthrough Therapy Designation, spinal muscular atrophy — Granted June 4, 2026. Once-yearly intrathecal antisense oligonucleotide. Designation only — not a filing acceptance or approval.

Etcamah (camizestrant, AstraZeneca) — CHMP positive opinion, ESR1-mutant HR+/HER2- breast cancer — Among 8 new-medicine positive opinions at the May 18-21, 2026 CHMP meeting; one refusal (Deqtynet). CHMP recommendation, not final EC approval.