General-Purpose AI Outperforms Specialized Clinical Tools in Head-to-Head Benchmark

A Nature Medicine study using three rigorous evaluations — including 100 real physician queries rated by 12 clinicians — found frontier large language models consistently outperformed purpose-built clinical AI products from OpenEvidence and UpToDate. The findings do not assess patient outcomes.

General-purpose large language models (LLMs) outperformed two commercial clinical AI tools across every benchmark stage tested, according to a peer-reviewed study published June 12, 2026, in Nature Medicine (PMID 42286322).

Researchers from NYU Langone Health and The University of Texas at Austin evaluated three frontier LLMs — GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 — against two specialized clinical AI products: OpenEvidence and UpToDate Expert AI. The evaluation comprised three sequential stages:

MedQA — 500 standardized medical knowledge questions
HealthBench — 500 items measuring alignment with clinician expectations
Real Clinical Queries (RCQ) — 100 de-identified queries submitted by physicians to a general-purpose LLM in a live clinical setting, rated by 12 U.S. clinicians in a randomised blinded review producing 1,800 model-question annotations

Frontier LLMs outperformed the specialized clinical AI tools across all three evaluation stages. On the RCQ benchmark — the evaluation most grounded in actual clinical use — the specialized tools performed comparably to Google Search AI Overview, a general consumer AI product.

Important limitations: This was a benchmark-based, retrospective evaluation, not a prospective study of patient-care outcomes. The study does not establish whether using frontier LLMs in clinical settings improves or harms patients. The authors call for independent, real-world evaluation of AI tools before clinical adoption. Results also do not address security, regulatory compliance, or workflow integration requirements that clinical tools may be designed to meet.

Trace · every claim, sourced

Reported and written by Owen Tanaka, Digital Health & AI Desk, then each load-bearing claim was bound to the primary source it rests on and checked out-of-band against that source before publication. The full mapping is below — nothing here is taken on faith.

After publication, a separate AI panel re-verifies every edition against these same sources. The running claim-confirmation rate and every correction are public on the accuracy ledger.

Trials Today

Phase 3 C-PRE (NCT06568172) — NCI cemiplimab perioperative CSCC trial — now suspended after 16 months

Phase 2b/3 NAAVIGATE (NCT07592273) — AbbVie surabgene lomparvovec gene therapy for diabetic retinopathy; n=576

Phase 3 NCT05605964 REPLACE-CV — Cardiovascular outcomes trial

Phase 3 NCT06424288 EASi-HF — Heart failure management trial

At the Agencies

Ozekibart BLA accepted — First-ever BLA for chondrosarcoma; PDUFA April 14, 2027

Colorado Section 804 SIP authorized — Second state approved to import Rx drugs from Canada; no shipments yet

Ixchiq (chikungunya vaccine) restriction — FDA safety communication on use in immunocompromised adults

Valproate paternal-exposure guidance — EMA recommends updated pregnancy warnings for male patients on valproate