ChatGPT manages the neural boards

featured image

ChatGPT significantly outperformed humans in a mock version of American Board of Psychiatry and Neurology (ABPN) boards.

ChatGPT-4, the latest version of Open AI’s large language model, outperformed the average human score in the ABPN-certified question bank, answering 85% of questions correctly versus 73.8% of the human score, according to Varun Venkataramani, Ph.D. Medicine, Ph.D., from Heidelberg University Hospital in Germany, and co-authors.

The older model, ChatGPT-3.5, answered only 66.8% of the questions correctly. “Both models used confident or very confident language, even when it was incorrect,” Venkataramani and colleagues report. The JAMA Network is open.

The study is a good demonstration of the power and capabilities of large language models like ChatGPT, but its findings could be misinterpreted, noted Lyle Jones Jr., MD, of the Mayo Clinic in Rochester, Minnesota, who was not involved in the study.

“This paper shows that ChatGPT can answer multiple choice questions correctly,” Jones said. MedPage Today. “It does not prove that ChatGPT can practice clinical medicine or serve as a substitute for clinical decision making.”

“Tests, including multiple-choice tests, are tools designed to assess medical knowledge, which is only one area or competency required to practice medicine,” Jones continued. “Transformer technologies like those used in ChatGPT can predict text but do not conduct interviews, perform a physical examination, generate assessment and planning, interpret clinical data, and communicate results.”

He added that while it was a remarkable technical achievement for the program to answer so many questions correctly, “the error rate was still high, and the tendency to express certainty while still being incorrect represents an additional risk or caution in using large language modeling tools.” “. .

Transformer-based natural language processing tools such as GPT can enhance neuroscience clinical care but come with limitations and risks, including fabricated facts. The risks and benefits are addressed in a recent paper in Neurology This showed that ChatGPT provided potentially dangerous advice to a young woman with epilepsy who wanted to become pregnant. Research involving medical questions in other specialties has indicated that, despite improvements, ChatGPT 3.5 or 4 should not be relied upon as the sole source of medical knowledge.

In their study, Venkataramani and co-authors used an ABPN-validated question bank and classified the questions as either lower or higher order based on Bloom’s Taxonomy. Lower-order questions were assessed for recall and basic understanding; Higher-order questions measure when applying, analyzing, or evaluating information.

The 2,036-question bank was similar to a neurology board exam and was part of a self-assessment program that could be used for continuing medical education (CME) credit; A score of 70% was the threshold for continuing medical education. The researchers excluded 80 questions — those containing videos or images, and those built on previous questions — leaving 1,956 questions in the study.

Both large language models were housed on a server and trained on over 45 terabytes of text data from websites, books, and articles. Neither of them had the ability to search the Internet.

GPT-3.5 matched human users on lower-order questions but lagged on higher-order questions. GPT-4 outperformed humans on both lower and higher order questions. The GPT-4 performed better on questions related to behavioral, cognitive, and psychological categories (89.8%) than on questions related to epilepsy and seizures (70.9%) or neuromuscular topics (78.8%).

On a 5-point Likert scale, both models consistently rated their confidence in their answers as confident or very confident, regardless of whether their answer was correct. When asked for a correct answer after an incorrect answer, both models apologized and agreed with the answer given in all cases.

One limitation of the study is that official ABPN board exam questions could not be used due to their confidential and structured nature, Venkataramani and co-authors said. In addition, the passing score was approximate based on the ABPN threshold for continuing medical education.

It’s unclear what clinical or educational benefit these findings have, Jones noted. “It’s a great tech demo, but do we need software that can perform tests designed for humans?” Asked.

“The most interesting study in this context is to use ChatGPT to generate high-quality multiple-choice questions, learning cases, or other educational materials,” he suggested. “In any application, error rates are high enough that any application of transformer technology in clinical or educational settings requires careful human verification and fact checking.”

  • Judy George covers neuroscience and neuroscience news for MedPage Today, writing about brain aging, Alzheimer’s disease, dementia, multiple sclerosis, rare diseases, epilepsy, autism, headache, stroke, Parkinson’s disease, amyotrophic lateral sclerosis, concussion, chronic traumatic encephalopathy, Sleep, pain, and more. He follows

Disclosures

Venkataramani had no disclosures. One author reported a patent for agents for the treatment of glioma.

Jones has received publishing royalties from a health care publication, has unpaid relationships as a member of the Board of Directors of the Mayo Clinic Accountable Care Organization and the American Academy of Neurology Institute, and has received personal compensation for serving as an editor for the Mayo Clinic Journal of Accountable Care. American Academy of Neurology.

Primary source

The JAMA Network is open

Source Reference: Schubert MC, et al “Performance of Large Language Models in a Neuroplate-Modeled Screening” JAMA Netw Open 2023; DOI: 10.1001/jamanetworkopen.2023.46721.

Previous Post Next Post

Formulaire de contact