Education Research: Can Large Language Models Match MS Specialist Training?: A Comparative Study of AI and Student Responses to Support Neurology Education
Research output: Contribution to journal › Research article › Contributed › peer-review
Contributors
Abstract
BACKGROUND AND OBJECTIVES: Artificial intelligence (AI), particularly large language models (LLMs), is increasingly explored for clinical decision support and medical education. While general LLM proficiency on broad medical examinations has been demonstrated, their application of domain-specific knowledge in neurology remains underexplored. This study addresses that gap using multiple sclerosis (MS) as an exemplar, evaluating how LLM information access strategies affect accuracy in a specialized postgraduate curriculum and exploring possible roles of LLMs in neurology education.
METHODS: A comparative evaluation was conducted using 53 multiple-choice questions (MCQs) and 21 open-ended questions drawn from an MS curriculum used in a postgraduate MS program. As a reference, results from postgraduate students-primarily neurologists and neurology trainees-were used. Each question was answered by 3 LLMs: GPT-4o (general-purpose), MS RAG (retrieval-augmented accessing MS literature), and Prof. Valmed (CE-certified domain-specific, trained on medical data). All models operated in the zero-shot mode without previous exposure to the items. Questions were stratified based on students' performance. Accuracy was compared using χ2 tests.
RESULTS: Among LLMs, GPT-4o reached 81.1% accuracy, MS RAG 86.8%, and Prof. Valmed 91.3% while the reference students' cohort (n = 28) achieved a mean of 82% (SD 23%). Although overall differences were not statistically significant (χ2(2) = 2.165, p = 0.339, Cramer V = 0.119), performance varied by question type and difficulty. For MCQs with a single correct answer, domain-specific LLMs outperformed GPT-4o, although differences remained nonsignificant. By contrast, students showed stronger performance on single-wrong answer formats. Stratified by difficulty, students outperformed LLMs on "easy" questions while LLMs tended to achieve higher accuracy on "medium" and "hard" items. For open-ended questions, students reached 77.8% accuracy while GPT-4o, MS RAG, and Prof. Valmed scored 66.7%-85.0%.
DISCUSSION: These findings indicate that while LLMs can perform at levels broadly comparable to postgraduate students, these may be particularly useful on more difficult tasks, where their consistency may complement human reasoning in a neurology subspecialty curriculum. While results should be interpreted cautiously given the limited sample size, this study illustrates possible implications of LLMs in neurology education-for example, as AI tutors for complex topics, as support for formative assessments, or as targeted review resources. Further research should assess integration into educational workflows and decision support.
Details
| Original language | English |
|---|---|
| Pages (from-to) | e200260 |
| Journal | Neurology. Education |
| Volume | 4 |
| Issue number | 4 |
| Publication status | Published - 20 Nov 2025 |
| Peer-reviewed | Yes |
External IDs
| PubMedCentral | PMC12636769 |
|---|---|
| ORCID | /0000-0001-8799-8202/work/198593606 |
| ORCID | /0000-0002-3730-5348/work/198594717 |
| ORCID | /0000-0002-1997-1689/work/198594741 |