Superhuman performance on urology board questions using an explainable language model enhanced with European Association of Urology guidelines
Publikation: Beitrag in Fachzeitschrift › Forschungsartikel › Beigetragen › Begutachtung
Beitragende
Abstract
BACKGROUND: Large language models encode clinical knowledge and can answer medical expert questions out-of-the-box without further training. However, this zero-shot performance is limited by outdated training data and lack of explainability impeding clinical translation. We aimed to develop a urology-specialized chatbot (UroBot) and evaluate it against state-of-the-art models as well as historical urologists' performance in answering urological board questions in a fully clinician-verifiable manner.
MATERIALS AND METHODS: We developed UroBot, a software pipeline based on the GPT-3.5, GPT-4, and GPT-4o models by OpenAI, utilizing retrieval augmented generation and the 2023 European Association of Urology guidelines. UroBot was benchmarked against the zero-shot performance of GPT-3.5, GPT-4, GPT-4o, and Uro_Chat. The evaluation involved 10 runs with 200 European Board of Urology in-service assessment questions, with the performance measured by the mean rate of correct answers (RoCA).
RESULTS: UroBot-4o achieved the highest RoCA, with an average of 88.4%, outperforming GPT-4o (77.6%) by 10.8%. Besides, it is clinician-verifiable and demonstrated the highest level of agreement between runs as measured by Fleiss' kappa (κ = 0.979). In comparison, the average performance of urologists on urological board questions is 68.7% as reported by the literature.
CONCLUSIONS: UroBot is a clinician-verifiable and accurate software pipeline and outperforms published models and urologists in answering urology board questions. We provide code and instructions to use and extend UroBot for further development.
Details
| Originalsprache | Englisch |
|---|---|
| Aufsatznummer | 100078 |
| Fachzeitschrift | ESMO real world data and digital oncology |
| Jahrgang | 6 |
| Publikationsstatus | Veröffentlicht - Dez. 2024 |
| Peer-Review-Status | Ja |
Externe IDs
| PubMedCentral | PMC12836625 |
|---|---|
| ORCID | /0000-0002-3730-5348/work/211722522 |