General-purpose large language models (LLMs) outperformed two commercial clinical AI tools across every benchmark stage tested, according to a peer-reviewed study published June 12, 2026, in Nature Medicine (PMID 42286322).

Researchers from NYU Langone Health and The University of Texas at Austin evaluated three frontier LLMs — GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 — against two specialized clinical AI products: OpenEvidence and UpToDate Expert AI. The evaluation comprised three sequential stages:

  1. MedQA — 500 standardized medical knowledge questions
  2. HealthBench — 500 items measuring alignment with clinician expectations
  3. Real Clinical Queries (RCQ) — 100 de-identified queries submitted by physicians to a general-purpose LLM in a live clinical setting, rated by 12 U.S. clinicians in a randomised blinded review producing 1,800 model-question annotations

Frontier LLMs outperformed the specialized clinical AI tools across all three evaluation stages. On the RCQ benchmark — the evaluation most grounded in actual clinical use — the specialized tools performed comparably to Google Search AI Overview, a general consumer AI product.

Important limitations: This was a benchmark-based, retrospective evaluation, not a prospective study of patient-care outcomes. The study does not establish whether using frontier LLMs in clinical settings improves or harms patients. The authors call for independent, real-world evaluation of AI tools before clinical adoption. Results also do not address security, regulatory compliance, or workflow integration requirements that clinical tools may be designed to meet.