ChatGPT Passes Board Exam

— Here's how two versions of the chatbot performed on mock radiology boards

by Michael DePeau-Wilson, Enterprise & Investigative Writer, MedPage Today May 16, 2023

A photo of a female radiologist looking at an MRI image with an android assistant.

OpenAI's ChatGPT has passed another medical exam -- this time achieving a passing score on a radiology board-style test, according to two new studies.

In assessments of both the GPT-3.5 version and GPT-4 version of ChatGPT, the AI chatbot improved from a near-passing score of 69.3% to a passing score of 80.7% on a 150-question radiology board-style examination, Rajesh Bhayana, MD, of University Medical Imaging Toronto in Canada, and colleagues reported in Radiology.

Specifically, the GPT-4 version outperformed the GPT-3.5 version on higher-order thinking questions (81% vs 60%, P=0.002), especially those involving descriptions of imaging findings (85% vs 61%, P=0.009) and applying medical concepts (90% vs 30%, P=0.006), they reported.

Notably, the newer version of the technology showed no improvement over the GPT-3.5 version on lower-order questions (80% vs 84%, P=0.64), they said.

"The improvement in higher-order reasoning, which suggests an improvement in understanding the contextual language in radiology, but also in medicine in general, does suggest that we're closer to ... downstream applications," Bhayana told MedPage Today.

The questions were text-based and multiple choice, and the researchers divided them into two broad categories (higher- and lower-order) to assess the specific strengths and weaknesses of the technology in providing answers. The questions were also selected to match the style, content, and difficulty of both the Canadian Royal College and American Board of Radiology examinations.

The investigators defined lower-order thinking questions as those that focused on knowledge recall and basic understanding. They defined the higher-order thinking questions as those focused on applications of knowledge and analyzing or synthesizing information.

This genre of research putting AI models to use in medical credentialing tests has become commonplace since ChatGPT's release on November 30, 2022, and passing medical board-style exams has been a long-held goal for AI developers, especially Google's medical-focused large language model (LLM) known as Med-PaLM.

Those AI testing efforts had a breakthrough moment in December 2022 when researchers showed that Med-PaLM achieved a 67.6% accuracy, a common threshold for passing scores, on the U.S. Medical Licensing Examination (USMLE). It was a major milestone in proving the capabilities of this technology in medicine -- similar to AI's ongoing competition with chess grandmasters through the years.

Then in March 2023, Google announced that an updated version of its LLM, called Med-PaLM 2, performed at "expert" physician levels on a series of practice USMLE questions while achieving 85% accuracy -- an improvement of 18 percentage points in less than 3 months.

In addition to those top marks, ChatGPT was recently assessed in its ability to answer patient-generated questions. When compared with real physician answers, evaluators in a blinded evaluation preferred ChatGPT's responses more than 75% of the time. The AI chatbot's answers were also rated as being significantly more empathetic than physicians' answers.

In the big picture, the efforts to prove what AI models can and can't do now is an exercise in setting benchmarks, Bhayana said.

He noted that the current focus is on determining how ChatGPT and other AI models can be used in medicine, but cautioned that the applications are limited due to the technologies' tendencies to "hallucinate" or lie -- often quite confidently. However, he hopes the technology will continue to improve to allow for broader applications in medicine.

"It's possible that we can get to a very, very high accuracy," he said. "Then it could be relied on more in clinical practice, but it's also possible that the technology has a threshold."

The goal is to learn how much physicians can trust these tools, then start to work on improving and optimizing the models for specific clinical uses. At the moment, generative AI has been shown to be efficient in certain uses, such as dictation and transcribing, but Bhayana believes it will take more time and updates before physicians will be able to trust these tools in more high-stakes clinical situations.

"As these tools come out, [the key will be] understanding how they perform, looking for applications of them, and then making sure that people are informed as to what their strengths and limitations are, so [physicians] can grow with the technology," Bhayana said.

Correction: An earlier headline on this story stated this was the first board exam ChatGPT has passed, but the chatbot has also passed the neurosurgery board exam.

Michael DePeau-Wilson is a reporter on MedPage Today’s enterprise & investigative team. He covers psychiatry, long covid, and infectious diseases, among other relevant U.S. clinical news. Follow

Disclosures

Bhayana and coauthors reported no relevant conflicts of interest.

Primary Source

Radiology

Source Reference: Bhayana R, et al "Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations" Radiology 2023; DOI:10.1148/radiol.230582.

Secondary Source

Radiology

Source Reference: Bhayana R, et al "GPT-4 in radiology: Improvements in advanced reasoning" Radiology 2023; DOI: 10.1148/radiol.230987.