Performance with AI varied among individual radiologists, with some improving and others becoming worse.

New study says performance boost with AI not uniform across individual radiologists

March 21, 2024
Even when applying the same AI solutions to the same cases, radiologist performance varied across individuals, with some seeing improvements and others becoming worse, suggesting that individual clinician differences affect accuracy even with these tools, and highlighting the need to personalize assistive AI systems.

That’s what researchers at Harvard Medical School, MIT, and Stanford found in a new study comparing performance for the use of these technologies in image interpretation. Whereas previous studies have evaluated the effects of AI on performance among radiologists as a whole, this one judged variability among individuals, looking at how factors such as area of specialty, years of practice, and prior use of AI tools affect human-AI collaboration.

While the investigation did not show how or why radiologists performed differently from one another, it did, unsurprisingly, find that more accurate AI solutions boosted performance and that poorly performing ones diminished accuracy, emphasizing the need to test and validate AI before clinically deploying these models.

“Our research reveals the nuanced and complex nature of machine-human interaction,” said study co-senior author Nikhil Agarwal, professor of economics at MIT, in a statement. “It highlights the need to understand the multitude of factors involved in this interplay and how they influence the ultimate diagnosis and care of patients.”

In the study, the researchers recorded how well radiologists using AI identified 15 pathologies on chest X-rays in 327 patient cases. Advanced computational methods captured the magnitude of change in performance when using and not using AI.

Those with low performance at baseline did not benefit consistently, with some improving, others becoming worse, and some exhibiting no change at all. Lower-performing ones at baseline showed lower performance with or without AI, and the same pattern was observed among those who were better. Years of experience, specialty, and whether a radiologist had used an AI reader were not reliable predictors.

The researchers say this does not mean that AI should not be deployed in medical settings, but that pretesting should be used as a safeguard against the adoption of inferior models that can interfere with performance and in turn, patient care. Testing for radiologist-AI interaction should also take place in experimental settings that mimic real-world scenarios and reflect the actual patient population for which tools are designed.

They also encourage AI developers to work with physicians who use their tools to better understand and incorporate findings around these interactions and for the design of solutions that can explain their decisions to train radiologists to detect inaccurate results and question diagnostic calls made by AI.

The findings were published in Nature Medicine.