General-purpose large language models (LLMs) outperformed two commercial clinical AI tools across every benchmark stage tested, according to a peer-reviewed study published June 12, 2026, in Nature Medicine (PMID 42286322).
Researchers from NYU Langone Health and The University of Texas at Austin evaluated three frontier LLMs — GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 — against two specialized clinical AI products: OpenEvidence and UpToDate Expert AI. The evaluation comprised three sequential stages:
- MedQA — 500 standardized medical knowledge questions
- HealthBench — 500 items measuring alignment with clinician expectations
- Real Clinical Queries (RCQ) — 100 de-identified queries submitted by physicians to a general-purpose LLM in a live clinical setting, rated by 12 U.S. clinicians in a randomised blinded review producing 1,800 model-question annotations
Frontier LLMs outperformed the specialized clinical AI tools across all three evaluation stages. On the RCQ benchmark — the evaluation most grounded in actual clinical use — the specialized tools performed comparably to Google Search AI Overview, a general consumer AI product.
Important limitations: This was a benchmark-based, retrospective evaluation, not a prospective study of patient-care outcomes. The study does not establish whether using frontier LLMs in clinical settings improves or harms patients. The authors call for independent, real-world evaluation of AI tools before clinical adoption. Results also do not address security, regulatory compliance, or workflow integration requirements that clinical tools may be designed to meet.