Conversational diagnostic AI has lived almost entirely in simulation, scored against vignettes and actors. A preprint from Google Research, Google DeepMind and Beth Israel Deaconess Medical Center (BIDMC) moves it into a clinic with real patients — cautiously, and with a human physician watching every word.
In this prospective, single-arm feasibility study (NCT06911398), 100 adults scheduled for non-emergency urgent-care visits at a leading academic medical center completed a pre-visit text chat with AMIE — the Articulate Medical Intelligence Explorer — up to five days before an in-person or telehealth appointment. AMIE took the history and generated a differential diagnosis and a transcript for the treating clinician.
What the study set out to measure
The pre-registered primary outcomes were not accuracy. They were safety and feasibility (the number and type of chat terminations), the quality of AMIE’s clinical dialogue, and the experiences of patients and physicians. Diagnostic accuracy and the head-to-head comparison against doctors were secondary outcomes — and the authors caution that the single-arm design “offers challenges to meaningfully evaluate” them.
On the primary safety endpoint, the result was clean: across all interactions, the physician “AI supervisors” — a panel of board-certified internists watching each chat live via secure video with screen-sharing — triggered zero of the four pre-specified stop criteria. That is the finding the study was built to produce.
It was not, however, fully hands-off. The paper reports the supervisor stepped in on three occasions: once to clarify symptoms in order to rule out a potentially emergent condition the patient did not have, once to clarify when to seek emergency care, and once to correct an AMIE error — the model stated that a patient’s past surgery date was in the future. So no consultation had to be halted, but a human did intervene, including to fix a hallucination.
The study was designed to answer whether a diagnostic chatbot can be run safely with real patients under supervision. On that question it returned a yes — with the human supervisor still doing real work.
What the differential caught, and what it didn’t
Accuracy was scored against a final diagnosis set by a blinded panel of three internists via chart review eight weeks after the visit. These figures cover the 98 patients with a confirmed final diagnosis, not the full 100, and they depend heavily on how many guesses you allow AMIE.
AMIE’s single leading diagnosis matched the final answer in 55 of 98 cases (56%, top-1). Widen the net to its first three candidates and that rises to 73 of 98 (75%, top-3); allow the first seven candidates of its ranked list and the correct diagnosis appeared in 88 of 98 (90%, top-7). The 90% figure, in other words, is a top-7 number — not “the differential was right nine times in ten.”
In a blinded comparison, specialists rated AMIE’s differentials and management plans against the primary care physicians’. There was no statistically significant difference for the differential diagnosis (p = 0.6) or for the appropriateness and safety of the management plan (p = 0.1 and p = 1.0). But PCPs were rated significantly better on the practicality (p = 0.003) and cost-effectiveness (p = 0.004) of their plans. Two caveats matter. AMIE’s differentials were truncated to the same length as the physicians’ before rating — AMIE tended to produce longer lists, which could reveal which was the AI — so this was not a like-for-like contest. And the authors note the comparison “favored physicians who had more context,” including the AMIE transcript itself, an EHR, and a physical exam AMIE never had.
Patients’ attitudes toward AI improved significantly after the encounter (p < 0.001), and clinicians reported the transcripts were useful for visit prep.
The authors are explicit about the limits: a single academic center, a text-only interface, no controlled comparison arm, and a small sample. This is a feasibility signal, not evidence of clinical benefit — and, as a preprint, it has not been peer reviewed. What it establishes is narrower than a diagnostic win: a diagnostic LLM can be run with live patients under physician oversight without any consultation having to be stopped.