The pitch for specialized medical AI is intuitive: a model trained only on curated, peer-reviewed literature should outperform a general-purpose chatbot trained on the open internet. A blinded randomized trial published in Neurosurgery tested that premise head-on, and the result is a useful cold shower for the domain-specific thesis.

Researchers at NYU Langone Health built CNS-Obsidian, a vision-language model fine-tuned from a 34-billion-parameter open model on 23,984 neurosurgical journal articles, which yielded 78,853 figures and captions and 263,064 training samples. They then ran it against a HIPAA-compliant GPT-4o endpoint as a diagnostic copilot, with neurosurgeons blinded and randomized to one model or the other after patient consultations between August and November 2024.

The headline finding: the specialist did not win. On the trial’s primary endpoints, CNS-Obsidian drew positive helpfulness ratings in 40.62% of cases versus 57.89% for GPT-4o (P = .230), and both models included the correct diagnosis in roughly 60% of cases (59.38% vs 65.79%, P = .626). Neither difference was statistically significant, but neither favored the home-grown model.

The interface, not the weights, may be the bottleneck

The more striking number is engagement. Of 959 total consultations during the trial window, clinicians invoked the copilot in just 70 — a 7.3% utilization rate — leaving only 32 CNS-Obsidian and 38 GPT-4o cases to evaluate. A tool that surgeons reach for in fewer than one in thirteen encounters is not yet part of the workflow, whichever model sits behind it.

Low clinical utilization suggests chatbot interfaces may not align with specialist workflows.

The benchmark data complicate the story further. CNS-Obsidian essentially matched GPT-4o on synthetic, model-generated questions (76.13% vs 77.54%, P = .235) but collapsed on human-written ones (46.81% vs 65.70%, P < 10⁻¹⁵) — a gap suggesting the specialist learned to answer questions shaped like its own training data, not the messier ones clinicians actually ask.

The authors’ framing is measured: a far smaller, cheaper model can approach frontier performance in a narrow domain, and the training pipeline offers a transparent template for other specialties. That is a real contribution. But association is not impact, and on this evidence the case for swapping a frontier model out for a bespoke one — and the deeper case that a chatbot is the right interface at all — remains unproven.