AI Pathology Models Pass Lab Tests But Stumble Across Hospitals, New Benchmark Finds

PathoROB evaluated 23 foundation models across 34 medical centers and found none achieved consistent robustness on all three measures — a result the authors say argues for pre-deployment testing before clinical adoption.

Digital pathology foundation models are large AI systems trained on millions of digitized tissue slides. Developers position them for high-stakes tasks such as cancer detection, tumor classification, and treatment-response assessment — analyses that currently require specialized human pathologists. But a study published in Nature Communications tested whether these models hold up when the slides come from institutions other than those used in training.

The benchmark, called PathoROB, assembled data spanning 34 medical centers and 28 distinct biological classes across four datasets. Researchers evaluated 23 publicly available foundation models against three metrics: a Robustness Index (whether biological signal dominates over institutional noise in a model’s internal representations), an Average Performance Drop (how much downstream task accuracy degrades when a model encounters out-of-distribution institutional features), and a Clustering Score (whether unsupervised groupings reflect biology rather than hospital-specific artifacts).

The “non-biological artifacts” that drive the problem are institution-level technical differences — variations in staining protocols, imaging equipment, and slide-processing techniques that differ across the 34 centers without reflecting any underlying change in patient biology.

No model scored at the ceiling across all three metrics. On the Robustness Index, scores ranged from 0.928 for the top-ranked model to 0.446 for the lowest — roughly a twofold gap between best and worst. The evaluation was retrospective: existing models were tested against archived multi-center data, not prospectively validated in live clinical settings.

The authors make the PathoROB benchmark and a public leaderboard available at github.com/bifold-pathomics/PathoROB, enabling developers and hospital procurement teams to run candidate models through the same evaluation before adoption.

The study does not conclude that current tools are unsafe — it concludes that robustness varies substantially and is measurable, which is a prerequisite for managing it before clinical deployment.

Trials Today

Phase 3 ZENITH / zilebesiran (siRNA antihypertensive) — 11,000-patient CV outcomes trial; composite endpoint of CV death, MI, stroke, HF hospitalization over ~5 years; twice-yearly dosing investigational.

Phase 3 COMBINE 2 / IcoSema (insulin icodec + semaglutide) — 683-patient trial in T2DM inadequately controlled on GLP-1 RA; 52-week results posted with HbA1c change as primary endpoint.

Phase 3 CONVOKE / CT-155 digital therapeutic for schizophrenia — 464-patient trial for negative symptoms of schizophrenia (Click Therapeutics/Boehringer Ingelheim); primary endpoint completed June 2025, results pending.

Phase 3 PRIOH-1 / pritelivir for acyclovir-resistant HSV — 158-patient trial in immunocompromised patients; novel helicase-primase inhibitor; completed November 2025, results imminent.

Phase 3 PREGnant / elagolix pre-IVF in endometriosis — NIH/NICHD-funded, 103-patient trial at Yale; live birth rate primary endpoint; completed May 2025, results posted.

At the Agencies

BD ChloraPrep / FREPP — Nationwide Voluntary Recall — Aspergillus penicillioides contamination in specific lots of ChloraPrep 1 mL and FREPP 1.5 mL chlorhexidine skin-prep applicators; risk of sepsis and death in surgical patients; distributed to hospitals March-June 2024.

J&J Cerenovus CEREPAK — Class I Recall — Higher-than-expected failure-to-detach rate linked to four serious injuries and one death; risks include hemorrhagic/ischemic stroke; ~12,000 units across 11 product lines.

GE HealthCare MIM Contour ProtegéAI+ 2.0 — FDA 510(k) Clearance — First radiation oncology AI software cleared with a Predetermined Change Control Plan (PCCP), allowing model updates without new 510(k) submissions; cleared June 4, 2026.