In the fall of 2024, JAMA Network published the results of a randomized clinical trial evaluating the ability of a type of artificial intelligence (AI), a large language model (LLM), on physicians’ diagnostic reasoning. Fifty physicians with a mean of 3 years in practice and training in a general medical specialty, including 26 attendings and 24 residents, participated in this single-blind trial. Physicians were given a series of case vignettes based upon actual patients, and given relevant history, e.g., physical examination, initial lab results, etc., and were then asked to provide 3 possible diagnoses and their reasoning, as well as a final diagnostic decision. They were given 60 minutes to evaluate up to a maximum of 6 cases (which were not overly complex or simplistic), using any of their conventional resources, such as UpToDate. They were graded both on the accuracy of their diagnosis, as well as their reasoning. One-half of the clinicians were randomized to also use an LLM interface, in this case, ChatGPT Plus (GPT-4), in addition to (not instead of) any of their conventional resources.The average number of completed cases per participant was 5. Those in the LLM group had a median score of 76%, while physicians using conventional resources only had a median score of 74%, yet this difference was not significant. This suggests that the use of an LLM did not improve clinical reasoning, although it did reduce the amount of time per case (519 vs. 565 seconds per case).
However, there was a 3rd comparison group, LLM alone (with no physician involvement). The median score per case was 92% in this arm of the study, a significant 16% increase in score. The authors of this paper make it clear that this surprising result does not justify giving LLMs diagnostic autonomy, but it does challenge the assumption that access to LLMs, at least in their current state and as utilized in this study, improves clinical reasoning. Instead, it suggests that the appropriate use of LLMs in health care is not immediately obvious.
There are other recent studies that illustrate other possible uses of AI in health care. For example, in March 2025 Nature Medicine published a comparison between certified electrocardiogram (ECG) technicians and an AI algorithm known as DeepRhythmAI. Data from an unselected population of nearly 15,000 individuals and over 211,000 days of ambulatory ECG monitoring was used, and over 2000 “critical arrhythmias” (e.g., atrial fibrillation, supraventricular tachycardia, etc.) were selected from this dataset by panels of cardiologists for comparison. The AI was found to have greater sensitivity for these critical arrhythmias, identifying 98.6% versus 80.3% identified by one of the 167 human technicians. This translates to a false negative rate of 3.2 per 1000 patients vs. a rate of 44.3 per 1000, a 14-fold difference. However, AI also had more false positives, a rate of 6.3% vs. 2.3% by human technicians. The authors of this paper concluded that AI could safely replace human technicians, given the superior sensitivity, with only a modest increase in false positives.
The first randomized and controlled study evaluating the use of AI for mammography screening, known as the MASAI trial, appeared in The Lancet Digital Health. Over 100,000 women (median age 53.7) in Sweden were randomly assigned to AI-supported screening (Transpara) or to double-reading by human radiologists (a standard practice in Europe, but not the U.S.). AI flagged readings as low, intermediate, or high risk, triaging the diagnosis for subsequent single or double human readings. AI-supported screening led to a significant 29% increase in cancer detection compared to human double-reading (338 vs. 262 total cases in each group). AI support also led to more recalls (a call-back for further evaluation) and a higher false positive rate (1.5% vs. 1.4%, or 772 vs. 765 total cases), but neither of these reached statistical significance. Overall, this study suggests that AI-supported screening, used as a triage, may increase the rate of cancer detection and substantially reduce the workload of human readers. This may be especially important in the U.S., where only a single read is likely to occur.
Thus, there are certainly specific use cases where the incorporation of AI into health care has a strong case, either as a stand-alone technology or as a tool that in some way improves human capabilities. The pace and breadth of growth of AI is rapid though. While many studies suggest that there are specific limitations of AI, most commonly assumed to be related to the ability to dialogue with patients, perform meaningful history-taking, etc., a 2025 study found that an LLM optimized for diagnostic dialogue (Articulate Medical Intelligence Explorer, or AMIE) outperformed primary care physicians on a variety of benchmarks according to specialist physicians and patient actors, including history-taking, differential diagnostic accuracy, communication ability, and even empathy. This study was certainly limited, as it was entirely online text-based consultations, but it raises the possibility that the role of AI in medicine may be quite broad indeed.