Potassium chloride is one of the drugs used in lethal injection. A few milliequivalents the wrong way, delivered too fast, can stop a heart. That is the unforgiving margin a new medRxiv preprint used to probe whether a leaderboard-listed large language model — one that features on the MedAgentBench benchmark, though not at the top of it — can safely handle a task that floods every acute-care unit: electrolyte replacement.

The answer, for now, is no — and the model does not seem to know it.

A team with Andrea Sikora (University of Colorado School of Medicine) as senior author built 20 clinician-annotated hypokalemia cases reflecting real-world complexity, well beyond the single-rule potassium task in the MedAgentBench benchmark. They tested GPT-5-Chat on each case in triplicate, with and without a clinician-curated dosing guideline, scoring six dimensions: potassium goals, dose, route, lab frequency, concurrent interventions, and the model’s own confidence and rating of case complexity.

The guideline helped, but not enough

Handed the dosing guideline, GPT-5-Chat’s average accuracy rose from 45% to 65%, and total errors fell from 165 to 104. Concurrent interventions and dosing drew the most errors in both arms. Potential-harm scores stayed “considerable” throughout, though severity eased when the guidance document was supplied.

The unsettling part is metacognition. GPT-5-Chat reported high confidence on 100% of responses — including the wrong ones — while flagging 80% of cases as highly complex with the guideline and 76% without it. It recognized difficulty and asserted certainty anyway.

Accuracy topped out at 65% with the rulebook in hand — yet the model voiced high confidence on every single answer.

For grounding, 54 clinicians reviewed the cases; they “highly” or “somewhat” agreed with the guideline-recommended management only 66.8% of the time, underscoring genuine practice variability.

The authors’ conclusion is a warning to benchmark-builders: single-rule leaderboards like the MedAgentBench potassium item overstate readiness. This is a preprint, not yet peer reviewed, and it tests one model on one electrolyte — but the safety signal is clear.

Correction (6 June 2026): An earlier headline and lede called GPT-5-Chat “leaderboard-topping.” The preprint describes it as a model that appears on the MedAgentBench leaderboard, not one that tops it — that benchmark is in fact led by other models. The wording has been changed to “leaderboard-listed.” Flagged by The Vital Record’s independent verification pass.