The Vital Record — Digital Health & AI

Colorado's Landmark AI Law Hits Its Enforcement Date — In Name Only

Owen Tanaka, Digital Health & AI Desk — Tue, 23 Jun 2026 00:00:00 +0000

June 30, 2026 — seven days from today — was supposed to be a watershed moment for healthcare AI regulation: the day Colorado’s SB 24-205 became the first U.S. state statute to impose binding obligations on developers and deployers of high-risk AI systems, with healthcare as an explicitly named sector.

The law, passed in 2024 and originally set to take effect February 1, 2026, was pushed back five months when Governor Jared Polis signed SB 25B-004 in August 2025. Under SB 24-205’s text, any AI system that makes or is a “substantial factor” in a “consequential decision” — defined as one with a material legal or similarly significant effect on healthcare access or delivery — would qualify as high-risk. Covered developers were required to use “reasonable care” to prevent algorithmic discrimination. Covered deployers were required to conduct algorithmic impact assessments, implement documented risk management policies, and notify patients when AI played a substantial role in decisions affecting them. The Colorado Attorney General held exclusive enforcement authority; the statute created no private right of action.

But the June 30 date now carries little practical weight. In April, Elon Musk’s xAI filed a First Amendment challenge in federal court; the Department of Justice intervened on the same side. On April 27, the court granted a joint motion by xAI and the Colorado AG to stay enforcement. On May 14, Governor Polis signed SB 26-189, which repeals and reenacts SB 24-205 under a narrower framework — dropping mandatory impact assessments, the freestanding reasonable-care duty, and self-reporting requirements. SB 26-189 takes effect January 1, 2027. The AG has stated he will not enforce either law until rulemaking concludes.

For health systems and clinical AI vendors, the practical compliance deadline has moved. The statutory obligations that defined SB 24-205’s original healthcare coverage — algorithmic impact assessments, documented risk policies, consequential-decision disclosures — are no longer the operative text. What comes next depends on rulemaking under a fundamentally different law.

AI Model Surfaces 18 New Diagnoses in 376 Previously Unsolved Pediatric Rare-Disease Cases

Owen Tanaka, Digital Health & AI Desk — Mon, 22 Jun 2026 00:00:00 +0000

For the families of children with rare genetic diseases, the path to a diagnosis can stretch across years, dozens of specialists, and still end without an answer. A study published June 18, 2026, in NEJM AI suggests that AI-assisted genomic reanalysis may open a narrow but meaningful door for some of those families — provided human geneticists remain firmly in the final seat.

Researchers at Boston Children’s Hospital’s Manton Center for Orphan Disease Research, Harvard University, and OpenAI applied the o3 Deep Research reasoning model to 376 de-identified pediatric cases that had previously undergone genetic testing and expert review without reaching a diagnosis. After the AI surfaced candidate gene-phenotype links and human clinical geneticists independently evaluated each output, 18 cases resulted in confirmed diagnoses — an incremental diagnostic yield of 4.8%.

The breakdown by disease area: 10 patients with rare neurodevelopmental disorders, four with neuromuscular disease, two cases of sudden unexpected death in pediatrics, and two patients with early-onset psychosis.

“It got almost 5% new diagnoses, which doesn’t sound like a lot, but considering how many times these had already been analyzed, that’s a huge number, and each one means an answer for a family,” said Catherine Brownstein, scientific director of the genetic investigations arm of the Manton Center and one of the study’s lead researchers.

How the pipeline worked — and what it did not do

The research team fed the o3 model a structured dossier for each case: clinicians’ notes, a description of the patient’s phenotype, and a filtered list of candidate genes. The model was asked to propose the most plausible molecular explanation and to show its reasoning — effectively generating evidence-linked hypotheses, not diagnoses.

From there, every model output required independent review by at least two board-certified clinical geneticists applying the ACMG/AMP variant classification framework. A finding advanced to diagnosis status only after four conditions were met: expert review, pathogenic or likely-pathogenic variant classification, confirmation in a CLIA-certified laboratory, and clinical return of the result to the family. The o3 model made no clinical decisions and issued no diagnoses.

Study type and key limitations

This was a retrospective reanalysis study. The 376 cases represent a selected cohort of patients who had already received prior negative workups — meaning the population was, by design, the hardest to diagnose. There is no prospective trial data yet demonstrating that the pipeline improves outcomes under real-world clinical deployment, nor a controlled comparison arm. The 4.8% yield figure should be understood in that context: it is an incremental gain on top of prior specialist evaluation, not a baseline diagnostic rate.

Conflict of interest

The collaboration is part of a broader initiative: in March 2025, OpenAI committed $50 million to its NextGenAI consortium, a 15-institution research partnership that includes Boston Children’s Hospital alongside Harvard, MIT, Caltech, Oxford, and ten other institutions. That financial relationship between the AI company whose model was evaluated and a member institution whose cases were studied is a material conflict of interest. Readers should weigh that connection when assessing the findings.

Correction, 2026-06-22: An earlier version of this article stated that OpenAI committed $50 million specifically to Boston Children’s Hospital AI initiatives, announced in May 2025. The commitment was made in March 2025 to OpenAI’s NextGenAI consortium — a 15-institution research partnership, of which Boston Children’s is one member — not a bilateral grant to Boston Children’s alone. The conflict-of-interest disclosure has been updated accordingly.

The study represents a peer-reviewed signal that AI-assisted genomic reanalysis can surface actionable hypotheses in cases that have exhausted standard pipelines. Whether that signal holds in a prospective, multicenter setting — and whether the 4.8% yield is reproducible across different institutions — remains to be established.

An AI Agent Outperformed Physicians on Simulated ED Cases — in a Sandboxed EHR, Not a Real Hospital

Owen Tanaka, Digital Health & AI Desk — Sun, 21 Jun 2026 00:00:00 +0000

An artificial intelligence agent called MIRA outperformed emergency physicians on diagnostic accuracy across eight clinical presentations in a study of 574 retrospective cases, according to a paper published in Nature — but the system has never treated a real patient, operates without FDA clearance, and was evaluated in a controlled sandboxed environment that differs substantially from the conditions of a functioning emergency department.

What the Study Did

The study evaluated MIRA — a multi-step AI agent designed to reason sequentially through clinical information — using a curated dataset of retrospective emergency department cases. The eight diagnoses included high-acuity conditions such as pulmonary embolism, acute myocardial infarction, sepsis, and stroke. Each case presented the AI with the same structured data available to physicians: history, physical examination findings, laboratory results, and imaging reports.

MIRA’s diagnostic accuracy exceeded that of the physician comparator group on the aggregate dataset and on six of the eight individual diagnoses. The system showed particular strength in synthesizing laboratory and imaging data simultaneously.

Retrospective case review and real-time emergency medicine are different tasks. In the study, the correct diagnosis was already known and cases were selected to include specific conditions. In an actual emergency department, the clinician faces undifferentiated presentations where the diagnosis is unknown in advance and most patients do not have the eight diagnoses evaluated here. Removing that uncertainty changes what the AI is actually being asked to do.

What the Study Does Not Establish

The paper does not address clinical workflow integration, time pressure, handling of ambiguous presentations, error consequences in life-threatening situations, or liability. MIRA has not received FDA 510(k) clearance or De Novo authorization and has not demonstrated performance in an actual emergency department.

Independent experts commenting through the Science Media Centre noted that validation on prospective, unselected patient populations — where AI systems routinely perform worse than on curated test sets — would be required before clinical conclusions could be drawn.

MIRA autonomous clinical AI. Nature. 2026; doi:10.1038/s41586-026-10675-5. Science Media Centre expert commentary, June 2026.

Correction (June 21, 2026): An earlier version of this article stated MIRA was evaluated on “311 retrospective emergency cases.” The Nature paper evaluated MIRA across 574 emergency department cases from the MIMIC-IV dataset; 311 is the per-arm count in a triple-evaluation audit subset, not the total number of cases in the evaluation. The dek, front-matter claim, and body text have been corrected.

Colorado Enacts Laws Barring AI-Only Coverage Denials and Unsupervised AI Psychotherapy

Digital Health & AI Desk — Sat, 20 Jun 2026 00:00:00 +0000

Colorado Governor Jared Polis signed two health-care artificial intelligence bills into law this week, adding the state to a growing roster of jurisdictions that have moved to regulate algorithmic decision-making in insurance coverage and mental health care delivery.

HB 26-1139: AI Coverage Denials

House Bill 26-1139 prohibits health insurers operating in Colorado from issuing prior-authorization denials based solely on the output of an automated algorithm or artificial intelligence system. Any AI-assisted denial must be reviewed and affirmed by a licensed clinician, and the insurer must document the clinical basis for the determination and provide the treating clinician an opportunity to submit additional information before a final decision is issued.

Colorado joins a growing roster of states with analogous provisions. California’s SB 1120, effective January 1, 2025, was the first such law to take effect; Arizona, Maryland, Texas, Connecticut, and Nebraska have since enacted similar requirements. A parallel rule under development at the Centers for Medicare & Medicaid Services would apply an equivalent requirement to Medicare Advantage plans nationally, though that rule has not been finalized.

HB 26-1195: AI Psychotherapy

House Bill 26-1195 requires that any AI-generated psychotherapy — including chatbot-delivered cognitive behavioral therapy, mental health coaching applications, and related interventions — operating in Colorado must disclose the AI nature of the service to users and operate under the supervision of a licensed mental health professional who reviews clinical decisions at a minimum frequency set by the Colorado Department of Regulatory Agencies.

The law carves out pure self-help tools and platforms used solely for screening or psychoeducation. It does not prohibit AI-assisted therapy — only unsupervised AI performing therapeutic functions.

Illinois enacted the broadly similar Wellbeing of People Receiving AI-Assisted Therapy (WOPR) Act (House Bill 1806, signed August 4, 2025) before Colorado, making Illinois the first state with this specific supervision requirement. Nevada’s Assembly Bill 406 on AI mental health tools also preceded the Colorado measure.

Distinct from Colorado’s Broader AI Act

Both bills are narrowly targeted at health-care applications and are distinct from Colorado’s Senate Bill 205 (2024), the broader consumer-protection AI law that established risk-based requirements for high-stakes AI systems generally. The new health-care bills add sector-specific requirements on top of — not instead of — that general-purpose framework.

Colorado HB 26-1139 and HB 26-1195, signed 2026. Cal. SB 1120 (eff. Jan. 1, 2025). Ill. HB 1806 (WOPR Act, signed Aug. 4, 2025). Nev. AB 406.

Autonomous AI Agent Navigates EHR, Surpasses Physicians in Simulated Cases — Nature

Digital Health & AI Desk — Fri, 19 Jun 2026 00:00:00 +0000

An autonomous AI agent named MIRA (Medical Intelligent Reasoning Agent), described in a paper published in Nature (DOI: 10.1038/s41586-026-10675-5, PMID: 42310457), has demonstrated the ability to navigate full electronic health records and generate diagnostic and management decisions that surpassed the performance of attending physicians in a controlled simulation environment.

The study, published June 17, 2026, describes MIRA as a large-language-model-based agentic system trained to interact with a sandboxed electronic health record interface — browsing notes, ordering and interpreting tests, adjusting treatment plans, and documenting reasoning — without human assistance. The system was evaluated against a benchmark of complex inpatient cases drawn from de-identified records and was compared with the performance of board-certified physicians on the same cases.

MIRA outperformed the physician cohort on the primary accuracy metric, achieving superior performance on case resolution scores. The cases selected for the evaluation were specifically designed to include diagnostic uncertainty, polypharmacy, and comorbidity — conditions that challenge rule-based systems and require integrative reasoning.

Context and limitations

The study is a simulation benchmark, not a clinical deployment. MIRA was evaluated in a sandboxed environment using de-identified historical records; it was not deployed in a live inpatient setting with real-time physician oversight removed. The gap between benchmark performance and real-world clinical integration is substantial: EHR data in live settings are incomplete, ambiguous, and subject to documentation lag; patients can provide verbal information that changes the differential; and clinical decisions carry liability considerations and communication requirements that benchmarks do not capture.

The authors acknowledge these limitations. The result adds to a body of evidence showing that agentic AI systems can perform diagnostic reasoning at or above specialist-physician level in controlled evaluations — but clinical integration requires safety, reliability, and interpretability thresholds that no benchmark study alone can establish.

The Nature publication reports the first peer-reviewed large-cohort evaluation in which an agentic EHR navigation system surpassed physicians on the primary accuracy metric, though the paper itself does not make explicit priority claims against all prior work. The key distinction from earlier systems such as Google DeepMind’s AMIE is not data modality — AMIE had published multimodal capabilities (vision, documents, structured data) by 2025 — but operational scope: AMIE operates as a conversational dialogue agent, while MIRA executes agentic EHR actions (ordering tests, adjusting treatment plans, documenting in the chart) without human intermediation.

Correction (2026-06-19): Two errors corrected by post-publication fact-check. (1) The article described MIRA as “the first agentic EHR system to do so in a peer-reviewed large-cohort evaluation” — the MIRA paper (PMID 42310457) makes no such priority claim; this was editorial attribution without source support. (2) Google DeepMind’s AMIE was characterized as operating on “structured diagnostic scenarios or single-modality data” — this is inaccurate; by 2025 AMIE had published multimodal capabilities (vision, documents, structured data). The correct distinction is operational: AMIE is a conversational dialogue system that does not perform agentic EHR actions (ordering tests, adjusting treatment plans, documenting in the chart), while MIRA does.

FDA Cleared an AI Sepsis Monitor in April; Prospective Data Show an 18.7% Relative Mortality Reduction — With Conditions

Owen Tanaka, Digital Health & AI Desk — Wed, 17 Jun 2026 00:00:00 +0000

The FDA on April 30, 2026 cleared the Bayesian Health Sepsis Flagging Device (510(k) number K250680) under the premarket notification pathway, judging it substantially equivalent to Prenosis’ earlier De Novo-authorized ImmunoScore. The software — the Targeted Real-time Early Warning System (TREWS), originally developed at Johns Hopkins University — continuously parses EHR data streams and issues Sepsis Deterioration Alerts for adult inpatients at rising risk. The cleared indication covers hospital-wide adults; the device is not limited to the ICU.

The pivotal evidence is a prospective, multi-site implementation study published in Nature Medicine in July 2022 (Adams et al.). Across five hospitals, 590,736 patients were monitored; 6,877 met sepsis criteria and received a TREWS alert before antibiotic initiation. This was a prospective pre-post observational design with concurrent unexposed comparators — not a randomized controlled trial.

The primary finding: among sepsis patients whose alert was confirmed by a clinician within three hours of firing, in-hospital mortality fell by 3.3 percentage points (adjusted absolute reduction; 95% CI 1.7–5.1), representing an 18.7% relative reduction (95% CI 9.4–27.0). For the subgroup additionally flagged as high-risk, the absolute mortality benefit was 4.5 points (95% CI 0.8–8.3). TREWS detected 82% of sepsis cases before antibiotic initiation.

The benefit is conditional. The mortality reduction applied specifically to patients whose alerts were acted on within three hours — not across all alerted patients. The prospective observational design does not establish causation, and a confirmatory randomized trial has not been completed. The clearance was effective April 30, 2026; the K-number and cleared indication cited here are from manufacturer and independent trade press sources — the FDA 510(k) database entry for K250680 was not directly reviewed for this brief.

AI Grading Cuts False Referrals by 45 Points in Diabetic Macular Edema Screening

Vital Record Staff — Tue, 16 Jun 2026 00:00:00 +0000

Community-based screening for diabetic macular edema runs into a fundamental problem: the optometrists conducting OCT scans at the point of care frequently refer patients who do not have clinically significant DME. The downstream cost is appointment backlogs at specialist centers and delayed care for patients who genuinely need treatment.

A randomized controlled trial published in JAMA offers a direct, numerically large solution. Investigators across multiple community optometry clinics in Hong Kong SAR (ChiCTR2300075087) enrolled 276 patients with diabetes undergoing OCT screening and randomly assigned them to AI-assisted grading (n=137) or standard grading by community optometrists (n=139).

The primary outcome was the false-referral rate—the proportion of patients without true DME who were sent to specialist care anyway. AI-assisted grading produced a false-referral rate of 24.1% (95% CI, 14.6%–37.0%), against 69.1% (95% CI, 61.0%–76.1%) in the standard grading arm. The absolute reduction of 45.0 percentage points (95% CI, 32.1–56.2; p<0.001) is large enough to be practically significant in any system struggling with optometry-to-ophthalmology referral volume.

Sensitivity was fully preserved: both arms achieved 100.0% sensitivity for DME referral (95% CI, 100.0%–100.0%), meaning the AI system achieved its specificity gain without missing a single case of true DME—it became more accurate at ruling out non-disease without sacrificing detection. Median time to specialist appointment was reduced by 18 days in the AI group.

The AI system evaluated in the trial is not commercially deployed in China or elsewhere; the study was conducted at academic-affiliated community clinics with standardized OCT equipment, and generalizability to uncontrolled community settings or different imaging devices would require further validation. The investigators note that the model was not trained on the study population, which strengthens the external validity argument.

For healthcare systems investing in diabetic retinopathy screening infrastructure, the efficiency signal from this trial will be difficult to ignore.

Correction, 2026-06-16: An earlier version of this article stated sensitivity was 97.1% in both arms. The JAMA source (ChiCTR2300075087) reports 100.0% sensitivity for DME referral (95% CI, 100.0%–100.0%) in both groups; per-arm Ns have also been corrected to n=137 (AI-assisted) and n=139 (standard), per the published trial.

General-Purpose AI Outperforms Specialized Clinical Tools in Head-to-Head Benchmark

Owen Tanaka, Digital Health & AI Desk — Mon, 15 Jun 2026 00:00:00 +0000

General-purpose large language models (LLMs) outperformed two commercial clinical AI tools across every benchmark stage tested, according to a peer-reviewed study published June 12, 2026, in Nature Medicine (PMID 42286322).

Researchers from NYU Langone Health and The University of Texas at Austin evaluated three frontier LLMs — GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 — against two specialized clinical AI products: OpenEvidence and UpToDate Expert AI. The evaluation comprised three sequential stages:

MedQA — 500 standardized medical knowledge questions
HealthBench — 500 items measuring alignment with clinician expectations
Real Clinical Queries (RCQ) — 100 de-identified queries submitted by physicians to a general-purpose LLM in a live clinical setting, rated by 12 U.S. clinicians in a randomised blinded review producing 1,800 model-question annotations

Frontier LLMs outperformed the specialized clinical AI tools across all three evaluation stages. On the RCQ benchmark — the evaluation most grounded in actual clinical use — the specialized tools performed comparably to Google Search AI Overview, a general consumer AI product.

Important limitations: This was a benchmark-based, retrospective evaluation, not a prospective study of patient-care outcomes. The study does not establish whether using frontier LLMs in clinical settings improves or harms patients. The authors call for independent, real-world evaluation of AI tools before clinical adoption. Results also do not address security, regulatory compliance, or workflow integration requirements that clinical tools may be designed to meet.

ARPA-H Launches ADVOCATE to Fund Development of the First FDA-Authorized Autonomous AI System for Cardiovascular Care

Owen Tanaka, Digital Health & AI Desk — Mon, 15 Jun 2026 00:00:00 +0000

The Advanced Research Projects Agency for Health (ARPA-H) announced in January 2026 a new program called ADVOCATE — Agentic AI-EnableD CardioVascular CAre TransfOrmation — aimed at funding the development of the first FDA-authorized autonomous AI system for managing advanced cardiovascular disease around the clock.

The program targets two primary conditions: heart failure and cardiovascular disease in patients who have had a myocardial infarction. ARPA-H intends these AI agents to provide patients with personalised diet and exercise guidance, assist with care navigation including appointment scheduling, and — where authorised — autonomously write or modify prescriptions.

Three technical areas: ADVOCATE solicits proposals across three distinct technical areas:

TA-1 — Patient-facing clinical AI agent: Development of an AI agent capable of clinical reasoning, care navigation, and autonomous medical actions for individual cardiovascular patients.
TA-2 — Supervisory oversight agent: A separate AI system that monitors deployed clinical agents for continued accuracy and patient safety.
TA-3 — Health system integration: Recruitment and partnership with health systems for real-world design, development, and deployment.

Projected savings: ARPA-H states that if ADVOCATE technologies are successfully developed and widely adopted, the program could save $55 billion annually in U.S. healthcare costs. This is a conditional program-level projection, not a current or guaranteed figure.

Regulatory pathway: The program targets FDA authorisation on an approximately three-year timeline. The specific FDA pathway has not been publicly specified.

Program status as of June 15, 2026: No awardees have been publicly announced and no specific award dollar amounts have been disclosed. The program was in the proposal solicitation phase at announcement.

Correction — June 15, 2026: The original headline referred to “the First FDA-Authorized Autonomous AI Cardiologist,” a characterization not used by ARPA-H and one that implies a delivered product. ARPA-H describes ADVOCATE as aiming to develop “the first FDA-authorized agentic AI technology” for cardiovascular care — a future R&D goal, not an accomplished fact. The headline has been updated accordingly. The acronym expansion has also been corrected to reflect the official capitalization: “AI-EnableD” (not “AI-Enabled”).

AI Pathology Models Pass Lab Tests But Stumble Across Hospitals, New Benchmark Finds

Owen Tanaka, Digital Health & AI Desk — Sun, 14 Jun 2026 00:00:00 +0000

Digital pathology foundation models are large AI systems trained on millions of digitized tissue slides. Developers position them for high-stakes tasks such as cancer detection, tumor classification, and treatment-response assessment — analyses that currently require specialized human pathologists. But a study published in Nature Communications tested whether these models hold up when the slides come from institutions other than those used in training.

The benchmark, called PathoROB, assembled data spanning 34 medical centers and 28 distinct biological classes across four datasets. Researchers evaluated 20 publicly available foundation models against three metrics: a Robustness Index (whether biological signal dominates over institutional noise in a model’s internal representations), an Average Performance Drop (how much downstream task accuracy degrades when a model encounters out-of-distribution institutional features), and a Clustering Score (whether unsupervised groupings reflect biology rather than hospital-specific artifacts).

The “non-biological artifacts” that drive the problem are institution-level technical differences — variations in staining protocols, imaging equipment, and slide-processing techniques that differ across the 34 centers without reflecting any underlying change in patient biology.

No model scored at the ceiling across all three metrics. On the Robustness Index, scores ranged from 0.861 for the top-ranked model (averaged across all four datasets, per the paper’s primary results) to 0.446 for the lowest — approximately a twofold gap between best and worst. The evaluation was retrospective: existing models were tested against archived multi-center data, not prospectively validated in live clinical settings.

The authors make the PathoROB benchmark and a public leaderboard available at github.com/bifold-pathomics/PathoROB, enabling developers and hospital procurement teams to run candidate models through the same evaluation before adoption.

The study’s abstract warns that “non-robust FM representations can cause major diagnostic downstream errors preventing safe clinical adoption.” The authors argue that robustness varies substantially across institutions and is measurable — and that measuring it is a necessary prerequisite for safe clinical deployment, not a reassurance that current models are ready.

Correction — 2026-06-14: An earlier version of this article stated that PathoROB evaluated “23 publicly available foundation models.” The published paper (Nature Communications, PMID 42277006) reports 20 models. The model count has been corrected above. Additionally, the original article cited a top Robustness Index score of 0.928, which reflects a GitHub leaderboard composite figure; the paper’s primary averaged result across all four datasets is 0.861 — the text above has been updated to reflect the paper’s primary figure. Finally, the original article stated the study “does not conclude that current tools are unsafe” — the abstract explicitly warns that non-robust representations “can cause major diagnostic downstream errors preventing safe clinical adoption”; the framing has been corrected above. Desk: Digital Health & AI.

House Appropriations Panel Votes to Strip Funding from CMS's AI-Powered Medicare Prior-Authorization Pilot

Owen Tanaka, Digital Health & AI Desk — Thu, 11 Jun 2026 00:00:00 +0000

A House Appropriations subcommittee voted June 9 to block federal funding for the Centers for Medicare & Medicaid Services’ first-of-its-kind artificial intelligence prior-authorization pilot in traditional Medicare, escalating a bipartisan congressional challenge to a program that has drawn criticism from patient advocates, physicians, and lawmakers on both sides of the aisle.

The amendment, adopted by voice vote during the full committee’s fiscal year 2027 Labor-HHS spending bill markup, would bar any appropriated funds from being used to implement the Wasteful and Inappropriate Service Reduction model — known as WISeR — or any similar model that adds prior-authorization requirements to traditional Medicare. Subcommittee Chair Robert Aderholt (R-AL) characterized the amendment as bipartisan; Rep. Lois Frankel (D-FL) said she developed it alongside Rep. Andy Harris (R-MD).

What WISeR does. Launched on January 1, 2026, WISeR contracts with private technology vendors to conduct AI- and machine-learning-powered pre-payment review of select Part B services in six states: Arizona, Washington, New Jersey, Texas, Ohio, and Oklahoma. The model runs for six performance years through December 31, 2031. Thirteen service categories are currently active — including bioengineered skin substitutes, epidural steroid injections, several classes of implanted nerve stimulators, arthroscopic knee debridement for osteoarthritis, and cervical spinal fusion — services that CMS says carry elevated risk of waste, fraud, and inappropriate use. AI flags requests for human clinical review; all denials must be reviewed by a licensed clinician, with standard requests processed within three days and urgent requests within two. Notably, vendors are compensated not by flat fee but as a percentage of savings from services denied or diverted, an incentive structure critics have argued creates a financial bias toward refusal.

According to KFF analysis, roughly 1.1 million traditional Medicare beneficiaries nationwide received at least one WISeR-covered service in 2024, with those in the six pilot states now subject to prior authorization.

The GAO determination. Congressional opposition intensified after the Government Accountability Office concluded on May 12, 2026 that WISeR constitutes a rule under the Administrative Procedure Act and is therefore subject to the Congressional Review Act’s submission requirements — a finding that opened a formal legislative path to repeal. Senate and House Democrats had separately introduced joint resolutions of disapproval following that determination.

What the committee said. The amendment’s language states that “any proposal to impose prior authorization requirements in traditional Medicare should be subject to robust congressional oversight and transparent evaluation of impacts on beneficiary access to care, provider burden and program costs.”

What comes next. The FY2027 HHS spending bill must still pass the full House and then clear the Senate before any funding prohibition could take effect. WISeR continues to operate in the meantime. A parallel amendment passed the same committee during the FY2026 spending process but did not survive into final legislation, a precedent that leaves the program’s long-term fate uncertain.

An AI pathology test may flag which high-risk prostate cancers benefit most from abiraterone

Owen Tanaka, Digital Health & AI Desk — Tue, 09 Jun 2026 00:00:00 +0000

Abiraterone (Zytiga), added to long-term hormone therapy, extends survival in very-high-risk localized prostate cancer. It is also toxic and expensive, and clinicians have no reliable way to know which men truly need it. A new analysis of two STAMPEDE phase 3 trials argues that an artificial-intelligence read of a routine biopsy slide can help sort that out — and, importantly, the model was locked before the data were unblinded.

Researchers led by University College London and Artera applied a previously validated multimodal AI (MMAI) model — the ArteraAI Prostate test — to 1,137 men with non-metastatic, clinically very-high-risk disease enrolled in two sequential, non-overlapping abiraterone comparisons within the STAMPEDE platform (NCT00268476). The model combines digitized pathology images with PSA, tumor stage, and age, and used a pre-established 75th-percentile cutoff to split patients into “MMAI very-high-risk” and “standard high-risk” groups. The primary endpoint was metastasis-free survival (MFS) — a composite of time to metastasis or death, which is a surrogate rather than a direct measure of overall survival. In total, 583 men received long-term ADT and 554 received ADT plus abiraterone.

A biomarker that separates benefit from non-benefit

In the MMAI very-high-risk subgroup (N=268), adding abiraterone cut the risk of metastasis or death by more than half — hazard ratio 0.47 (95% CI 0.31–0.70). Five-year MFS rose from 62% (95% CI 53–70%) with ADT alone to 81% (74–87%) with abiraterone. Because MFS is a composite surrogate, this is not the same as a demonstrated overall-survival benefit in this subgroup.

In the larger standard-high-risk group (N=869), the drug added little: HR 0.83 (95% CI 0.63–1.09), with five-year MFS of 82% (78–85%) with ADT alone versus 84% (80–87%) with abiraterone — a small numerical difference that was not statistically significant. The treatment-by-biomarker interaction was significant (p=0.02), and the pattern held in both node-negative and node-positive subgroups.

A locked digital-pathology test predicted abiraterone benefit on metastasis-free survival — a surrogate endpoint — in very-high-risk disease and little benefit in the standard-high-risk group: a hypothesis for prospective testing.

The caveats are real. This is a post-hoc analysis of randomized data, not a prospective biomarker-stratified trial; the endpoint is a surrogate (metastasis or death), not overall survival; the two abiraterone comparisons shared no controls; and the standard-high-risk confidence interval still crosses 1, so a modest benefit there cannot be ruled out. The authors frame the test as a way to “maximize benefit from treatment intensification whilst avoiding unnecessary toxicity” — a hypothesis that a prospective trial would need to confirm before it changes practice.

Google's AMIE ran with 100 real patients and was never forced to stop. On diagnosis, the picture was mixed.

Owen Tanaka, Digital Health & AI Desk — Sun, 07 Jun 2026 00:00:00 +0000

Conversational diagnostic AI has lived almost entirely in simulation, scored against vignettes and actors. A preprint from Google Research, Google DeepMind and Beth Israel Deaconess Medical Center (BIDMC) moves it into a clinic with real patients — cautiously, and with a human physician watching every word.

In this prospective, single-arm feasibility study (NCT06911398), 100 adults scheduled for non-emergency urgent-care visits at a leading academic medical center completed a pre-visit text chat with AMIE — the Articulate Medical Intelligence Explorer — up to five days before an in-person or telehealth appointment. AMIE took the history and generated a differential diagnosis and a transcript for the treating clinician.

What the study set out to measure

The pre-registered primary outcomes were not accuracy. They were safety and feasibility (the number and type of chat terminations), the quality of AMIE’s clinical dialogue, and the experiences of patients and physicians. Diagnostic accuracy and the head-to-head comparison against doctors were secondary outcomes — and the authors caution that the single-arm design “offers challenges to meaningfully evaluate” them.

On the primary safety endpoint, the result was clean: across all interactions, the physician “AI supervisors” — a panel of board-certified internists watching each chat live via secure video with screen-sharing — triggered zero of the four pre-specified stop criteria. That is the finding the study was built to produce.

It was not, however, fully hands-off. The paper reports the supervisor stepped in on three occasions: once to clarify symptoms in order to rule out a potentially emergent condition the patient did not have, once to clarify when to seek emergency care, and once to correct an AMIE error — the model stated that a patient’s past surgery date was in the future. So no consultation had to be halted, but a human did intervene, including to fix a hallucination.

The study was designed to answer whether a diagnostic chatbot can be run safely with real patients under supervision. On that question it returned a yes — with the human supervisor still doing real work.

What the differential caught, and what it didn’t

Accuracy was scored against a final diagnosis set by a blinded panel of three internists via chart review eight weeks after the visit. These figures cover the 98 patients with a confirmed final diagnosis, not the full 100, and they depend heavily on how many guesses you allow AMIE.

AMIE’s single leading diagnosis matched the final answer in 55 of 98 cases (56%, top-1). Widen the net to its first three candidates and that rises to 73 of 98 (75%, top-3); allow the first seven candidates of its ranked list and the correct diagnosis appeared in 88 of 98 (90%, top-7). The 90% figure, in other words, is a top-7 number — not “the differential was right nine times in ten.”

In a blinded comparison, specialists rated AMIE’s differentials and management plans against the primary care physicians’. There was no statistically significant difference for the differential diagnosis (p = 0.6) or for the appropriateness and safety of the management plan (p = 0.1 and p = 1.0). But PCPs were rated significantly better on the practicality (p = 0.003) and cost-effectiveness (p = 0.004) of their plans. Two caveats matter. AMIE’s differentials were truncated to the same length as the physicians’ before rating — AMIE tended to produce longer lists, which could reveal which was the AI — so this was not a like-for-like contest. And the authors note the comparison “favored physicians who had more context,” including the AMIE transcript itself, an EHR, and a physical exam AMIE never had.

Patients’ attitudes toward AI improved significantly after the encounter (p < 0.001), and clinicians reported the transcripts were useful for visit prep.

The authors are explicit about the limits: a single academic center, a text-only interface, no controlled comparison arm, and a small sample. This is a feasibility signal, not evidence of clinical benefit — and, as a preprint, it has not been peer reviewed. What it establishes is narrower than a diagnostic win: a diagnostic LLM can be run with live patients under physician oversight without any consultation having to be stopped.

Philips clears Elevate Plus, moving Koios breast and thyroid AI onto the ultrasound cart

Owen Tanaka, Digital Health & AI Desk — Sun, 07 Jun 2026 00:00:00 +0000

Philips said on June 2 that the FDA granted 510(k) clearance for Elevate Plus, a software upgrade for its EPIQ Elite and Affiniti general-imaging ultrasound systems. The shift is less about pixels than about where the AI lives: the company is moving its Koios decision-support engine onto the scanner itself, so the read happens at the cart rather than on a separate workstation.

Two features anchor the release. Auto Measure Abdomen automates routine abdominal measurement steps, which Philips says deliver “over 93% accuracy compared to manual measurements by clinical experts.” And Koios AI, now available on-cart, brings decision support to breast and thyroid imaging. Per the company, the breast tool — Koios Bi-RADS — “offers interpretation and assessment of the risk of malignancy in under 2 seconds”; the thyroid tool, Koios Ti-RADS, is described as supporting “confident lesion classifications using over 350,000 pathology-proven cases.” (BI-RADS and TI-RADS are the standard radiology scoring systems for breast and thyroid imaging that those tool names reference.) The under-two-second figure is attached only to the breast pathway in Philips’ own wording, not to thyroid. The upgrade also bundles imaging refinements branded XRes Pro+ and Super Res MVI Pro.

The numbers carry caveats

The 93% accuracy figure is not unsupported, but the evidence behind it is thin. A footnote in the release says the number was “obtained from a retrospective data analysis study involving data from 150 subjects (using MD.AI annotation tool, 3 clinical experts).” That is a small, internal, retrospective comparison against three expert annotators — not a peer-reviewed or prospective trial.

The most quotable figure — an up-to-30% reduction in scanning time — rests on even less. It comes from a testimonial attributed to Gretchen Sammy, an ultrasound manager at Boston Medical Center, in the company’s release: “Automating key measurement tasks allows our sonographers to reduce scanning time by up to 30% without sacrificing clinical precision.” That is a single-site customer statement about automating measurement tasks in general — the release does not tie the 30% to any one named feature — and it is not a controlled study. Philips has posted no trial behind it.

A 510(k) clearance establishes substantial equivalence to a predicate device; it is not evidence that the AI improves patient outcomes.

For now, Elevate Plus is a workflow-and-triage story: faster measurements and on-cart decision support meant to standardize how sonographers flag breast and thyroid findings. The release makes no patient-outcome or clinical-benefit claim. Whether the tool translates into fewer missed lesions or shorter reporting cycles is a question the marketing materials cannot answer.

A leaderboard-listed LLM still botches potassium dosing in a preprint stress-test — while claiming full confidence

Owen Tanaka, Digital Health & AI Desk — Sat, 06 Jun 2026 00:00:00 +0000

Potassium chloride is one of the drugs used in lethal injection. A few milliequivalents the wrong way, delivered too fast, can stop a heart. That is the unforgiving margin a new medRxiv preprint used to probe whether a leaderboard-listed large language model — one that features on the MedAgentBench benchmark, though not at the top of it — can safely handle a task that floods every acute-care unit: electrolyte replacement.

The answer, for now, is no — and the model does not seem to know it.

A team with Andrea Sikora (University of Colorado School of Medicine) as senior author built 20 clinician-annotated hypokalemia cases reflecting real-world complexity, well beyond the single-rule potassium task in the MedAgentBench benchmark. They tested GPT-5-Chat on each case in triplicate, with and without a clinician-curated dosing guideline, scoring six dimensions: potassium goals, dose, route, lab frequency, concurrent interventions, and the model’s own confidence and rating of case complexity.

The guideline helped, but not enough

Handed the dosing guideline, GPT-5-Chat’s average accuracy rose from 45% to 65%, and total errors fell from 165 to 104. Concurrent interventions and dosing drew the most errors in both arms. Potential-harm scores stayed “considerable” throughout, though severity eased when the guidance document was supplied.

The unsettling part is metacognition. GPT-5-Chat reported high confidence on 100% of responses — including the wrong ones — while flagging 80% of cases as highly complex with the guideline and 76% without it. It recognized difficulty and asserted certainty anyway.

Accuracy topped out at 65% with the rulebook in hand — yet the model voiced high confidence on every single answer.

For grounding, 54 clinicians reviewed the cases; they “highly” or “somewhat” agreed with the guideline-recommended management only 66.8% of the time, underscoring genuine practice variability.

The authors’ conclusion is a warning to benchmark-builders: single-rule leaderboards like the MedAgentBench potassium item overstate readiness. This is a preprint, not yet peer reviewed, and it tests one model on one electrolyte — but the safety signal is clear.

Correction (6 June 2026): An earlier headline and lede called GPT-5-Chat “leaderboard-topping.” The preprint describes it as a model that appears on the MedAgentBench leaderboard, not one that tops it — that benchmark is in fact led by other models. The wording has been changed to “leaderboard-listed.” Flagged by The Vital Record’s independent verification pass.

Artera sells a metastatic prostate-cancer mortality estimate on a validation it has not disclosed

Owen Tanaka, Digital Health & AI Desk — Fri, 05 Jun 2026 00:00:00 +0000

Artera and its commercial partner Tempus have launched a version of the ArteraAI Prostate test for men with metastatic hormone-sensitive prostate cancer (mHSPC). Artera bills it as “the first digital pathology-based prognostic test designed to help inform treatment planning” for that setting; Tempus uses narrower language, calling it the “first prostate digital pathology algorithm in the Tempus ecosystem available for clinical use” and the first externally developed one in that ecosystem. The test reads a digitized biopsy slide alongside clinical variables through the company’s multimodal AI (MMAI) model and returns, per Artera, “a patient-specific estimate of 5-year prostate cancer-specific mortality (PCSM) in patients treated with androgen deprivation therapy (ADT) plus an androgen receptor pathway inhibitor (ARPI).”

The verifiable problem is what supports that specific number.

The marketed claim cites a validation the company does not disclose

For the marketed setting, Artera’s own release says the model “was further validated in patients with mHSPC receiving ADT plus ARPI as significantly prognostic for prostate cancer-specific mortality.” But none of the materials announcing the launch — Artera’s release or Tempus’s — name that validation’s trial or cohort, give its sample size, report a hazard ratio or confidence interval, or point to a peer-reviewed publication. A reader is asked to accept a PCSM estimate in ADT-plus-ARPI patients on the strength of a study whose existence is asserted but whose data and citation appear in none of these sources. Under a primary-source standard, that claim is unverifiable as published.

The peer-reviewed validation measured a different endpoint, in a different regimen

The one validation that is peer-reviewed and traceable is a separate analysis, published in European Urology Oncology, built on the phase 3 CHAARTED trial (NCT00309985). It does not match the marketed claim on two counts.

First, the endpoint. CHAARTED’s primary outcome was overall survival (OS), defined in the paper as “the time from randomization until death from any cause” — not the prostate cancer-specific mortality the product reports. Of 790 patients enrolled, 586 (74.2%) had evaluable digital pathology and 456 had the clinical data needed to generate MMAI scores. In that cohort the continuous MMAI score was associated with OS, the primary endpoint, with a hazard ratio of 1.51 per standard deviation (95% CI 1.33–1.73; p<0.001). Estimated 5-year overall survival was 83% in the MMAI-low group, 58% intermediate, and 39% high. The secondary outcomes were clinical progression and castration-resistant disease; prostate cancer-specific mortality was not reported as an outcome.

Second, the regimen. CHAARTED randomized men to ADT alone versus ADT plus docetaxel chemotherapy. There was no ARPI arm. The setting Artera markets — ADT plus an ARPI — is not the regimen in which this published model was tested.

Prognostic, not predictive

In the CHAARTED analysis, the data support sorting risk rather than guiding therapy. The authors reported “no statistically significant interaction between treatment and either continuous or categorical MMAI risk scores,” called the analysis underpowered, and state the model “did not predict for docetaxel benefit.” The score identifies who is likely to do worse; in this dataset it did not show which men benefit more from adding chemotherapy.

The peer-reviewed validation measured all-cause survival in a docetaxel trial. The cause-specific mortality, ADT-plus-ARPI claim the test is sold on rests on a separate validation for which the company names no trial, number, or citation.

The mHSPC test is offered as a laboratory-developed test, distinct from Artera’s localized-disease indication authorized by FDA via De Novo (DEN240068), which covers a 10-year distant-metastasis and prostate cancer-specific mortality risk estimate in treatment-naive, non-metastatic patients. Whether a mortality estimate shifts treatment in a metastatic setting where intensification is already standard remains an open clinical question — and one that the undisclosed ADT-plus-ARPI validation, if published with its endpoint, comparator, sample size, and effect size, would help answer. This is journalism, not medical advice.

A specialist neurosurgery AI couldn't beat GPT-4o, and surgeons rarely used either

Owen Tanaka, Digital Health & AI Desk — Wed, 03 Jun 2026 00:00:00 +0000

The pitch for specialized medical AI is intuitive: a model trained only on curated, peer-reviewed literature should outperform a general-purpose chatbot trained on the open internet. A blinded randomized trial published in Neurosurgery tested that premise head-on, and the result is a useful cold shower for the domain-specific thesis.

Researchers at NYU Langone Health built CNS-Obsidian, a vision-language model fine-tuned from a 34-billion-parameter open model on 23,984 neurosurgical journal articles, which yielded 78,853 figures and captions and 263,064 training samples. They then ran it against a HIPAA-compliant GPT-4o endpoint as a diagnostic copilot, with neurosurgeons blinded and randomized to one model or the other after patient consultations between August and November 2024.

The headline finding: the specialist did not win. On the trial’s primary endpoints, CNS-Obsidian drew positive helpfulness ratings in 40.62% of cases versus 57.89% for GPT-4o (P = .230), and both models included the correct diagnosis in roughly 60% of cases (59.38% vs 65.79%, P = .626). Neither difference was statistically significant, but neither favored the home-grown model.

The interface, not the weights, may be the bottleneck

The more striking number is engagement. Of 959 total consultations during the trial window, clinicians invoked the copilot in just 70 — a 7.3% utilization rate — leaving only 32 CNS-Obsidian and 38 GPT-4o cases to evaluate. A tool that surgeons reach for in fewer than one in thirteen encounters is not yet part of the workflow, whichever model sits behind it.

Low clinical utilization suggests chatbot interfaces may not align with specialist workflows.

The benchmark data complicate the story further. CNS-Obsidian essentially matched GPT-4o on synthetic, model-generated questions (76.13% vs 77.54%, P = .235) but collapsed on human-written ones (46.81% vs 65.70%, P < 10⁻¹⁵) — a gap suggesting the specialist learned to answer questions shaped like its own training data, not the messier ones clinicians actually ask.

The authors’ framing is measured: a far smaller, cheaper model can approach frontier performance in a narrow domain, and the training pipeline offers a transparent template for other specialties. That is a real contribution. But association is not impact, and on this evidence the case for swapping a frontier model out for a bespoke one — and the deeper case that a chatbot is the right interface at all — remains unproven.

Philips puts AI auto-measurement and on-cart Koios on its general-imaging ultrasound

Owen Tanaka, Digital Health & AI Desk — Wed, 03 Jun 2026 00:00:00 +0000

Philips said it has received FDA 510(k) clearance for Elevate Plus, a software upgrade that brings AI auto-measurement and on-cart decision support to its EPIQ Elite and Affiniti general-imaging ultrasound systems. The package also carries a CE Mark.

The headline addition is Auto Measure Abdomen, which automates routine abdominal measurement steps. Philips reports it delivers “over 93% accuracy compared to manual measurements by clinical experts” and frames the tool as a way to reduce operator variability. The second pillar is Koios AI Decision Support, which Philips has now moved on-cart after previously offering it only off-cart. Koios classifies breast lesions (BI-RADS) and thyroid nodules (TI-RADS); the company says the thyroid model leverages “more than 350,000 pathology-proven cases” and that its breast read returns in “under 2 seconds.” Two imaging enhancements, XRes Pro+ and Super Res MVI Pro, round out the release.

Read the numbers carefully

The supporting figures here come from the vendor, not from a published trial. Philips does disclose the basis for the 93% figure in a footnote: it was “obtained from a retrospective data analysis study involving data from 150 subjects,” with annotations made using the MD.AI tool and three clinical experts as the comparison standard. That is a stated study type and sample size — but it is a vendor-internal retrospective analysis (n=150), not peer-reviewed, and no confidence interval is reported. A separate “up to 30%” reduction in scanning time is attributed to a single customer — an ultrasound manager at Boston Medical Center — and is not drawn from any study.

A 510(k) clears a device as substantially equivalent to a predicate. It is not a finding of clinical superiority, and none of the supporting figures here are peer-reviewed.

For clinicians, the practical shift is consolidation: measurements and lesion classification that previously required separate steps or off-cart tools now sit inside the scanning workflow. Whether the reported accuracy and time savings hold up in routine practice remains to be demonstrated outside Philips’s own materials.