Digital pathology foundation models are large AI systems trained on millions of digitized tissue slides. Developers position them for high-stakes tasks such as cancer detection, tumor classification, and treatment-response assessment — analyses that currently require specialized human pathologists. But a study published in Nature Communications tested whether these models hold up when the slides come from institutions other than those used in training.
The benchmark, called PathoROB, assembled data spanning 34 medical centers and 28 distinct biological classes across four datasets. Researchers evaluated 23 publicly available foundation models against three metrics: a Robustness Index (whether biological signal dominates over institutional noise in a model’s internal representations), an Average Performance Drop (how much downstream task accuracy degrades when a model encounters out-of-distribution institutional features), and a Clustering Score (whether unsupervised groupings reflect biology rather than hospital-specific artifacts).
The “non-biological artifacts” that drive the problem are institution-level technical differences — variations in staining protocols, imaging equipment, and slide-processing techniques that differ across the 34 centers without reflecting any underlying change in patient biology.
No model scored at the ceiling across all three metrics. On the Robustness Index, scores ranged from 0.928 for the top-ranked model to 0.446 for the lowest — roughly a twofold gap between best and worst. The evaluation was retrospective: existing models were tested against archived multi-center data, not prospectively validated in live clinical settings.
The authors make the PathoROB benchmark and a public leaderboard available at github.com/bifold-pathomics/PathoROB, enabling developers and hospital procurement teams to run candidate models through the same evaluation before adoption.
The study does not conclude that current tools are unsafe — it concludes that robustness varies substantially and is measurable, which is a prerequisite for managing it before clinical deployment.