A specialist neurosurgery AI couldn't beat GPT-4o, and surgeons rarely used either

In a blinded randomized trial at NYU Langone, a purpose-built vision-language model matched but did not surpass a general frontier model, and clinicians reached for the copilot in only 7.3% of consults.

The pitch for specialized medical AI is intuitive: a model trained only on curated, peer-reviewed literature should outperform a general-purpose chatbot trained on the open internet. A blinded randomized trial published in Neurosurgery tested that premise head-on, and the result is a useful cold shower for the domain-specific thesis.

Researchers at NYU Langone Health built CNS-Obsidian, a vision-language model fine-tuned from a 34-billion-parameter open model on 23,984 neurosurgical journal articles, which yielded 78,853 figures and captions and 263,064 training samples. They then ran it against a HIPAA-compliant GPT-4o endpoint as a diagnostic copilot, with neurosurgeons blinded and randomized to one model or the other after patient consultations between August and November 2024.

The headline finding: the specialist did not win. On the trial’s primary endpoints, CNS-Obsidian drew positive helpfulness ratings in 40.62% of cases versus 57.89% for GPT-4o (P = .230), and both models included the correct diagnosis in roughly 60% of cases (59.38% vs 65.79%, P = .626). Neither difference was statistically significant, but neither favored the home-grown model.

The interface, not the weights, may be the bottleneck

The more striking number is engagement. Of 959 total consultations during the trial window, clinicians invoked the copilot in just 70 — a 7.3% utilization rate — leaving only 32 CNS-Obsidian and 38 GPT-4o cases to evaluate. A tool that surgeons reach for in fewer than one in thirteen encounters is not yet part of the workflow, whichever model sits behind it.

Low clinical utilization suggests chatbot interfaces may not align with specialist workflows.

The benchmark data complicate the story further. CNS-Obsidian essentially matched GPT-4o on synthetic, model-generated questions (76.13% vs 77.54%, P = .235) but collapsed on human-written ones (46.81% vs 65.70%, P < 10⁻¹⁵) — a gap suggesting the specialist learned to answer questions shaped like its own training data, not the messier ones clinicians actually ask.

The authors’ framing is measured: a far smaller, cheaper model can approach frontier performance in a narrow domain, and the training pipeline offers a transparent template for other specialties. That is a real contribution. But association is not impact, and on this evidence the case for swapping a frontier model out for a bespoke one — and the deeper case that a chatbot is the right interface at all — remains unproven.

Trials Today

Phase 3 TROPiCS-04 (sacituzumab govitecan, urothelial cancer) — Negative confirmatory readout: primary OS endpoint missed, HR 0.86 (95% CI 0.73-1.02), p=0.0870; results posted to registry.

Phase 3 TRIUMPH-1 (retatrutide, obesity) — Status changed to COMPLETED (last update 2026-06-03); enrollment 2,335; no efficacy results posted yet (hasResults false).

Phase 3 LoTam (low-dose tamoxifen, early breast cancer) — SUSPENDED for protocol amendment to increase sample size (NCI/Alliance); reason stated as operational, not safety.

Phase 3 RASolute 305 (zoldonrasib, KRAS G12D pancreatic cancer) — Newly registered and recruiting first-line trial; co-primary PFS and OS; planned enrollment 670.

Phase 3 Sibeprenlimab (IgA nephropathy) — Status COMPLETED (last update 2026-06-03; primary completion 2026-05-13); full efficacy results not yet posted.

At the Agencies

Zaynich (cefepime/zidebactam) approved for complicated UTI — Novel beta-lactam/beta-lactamase-inhibitor combo, NDA 220787; first new chemical entity fully developed by an Indian company to win FDA approval.

Cingulate CTx-1301 (ADHD) receives Complete Response Letter — Per company, CRL cited only CMC/manufacturing deficiencies with no current safety or efficacy concerns; resubmission planned.

FDA draft guidance on platform knowledge for genome-editing therapies — Non-binding draft (June 2, 2026) on reusing platform CMC, nonclinical and clinical data; comment period before finalization.