Superhuman performance on urology board questions using an explainable language model enhanced with European Association of Urology guidelines

Research output: Contribution to journalResearch articleContributedpeer-review

Contributors

  • M J Hetz - , German Cancer Research Center (DKFZ) (Author)
  • N Carl - , German Cancer Research Center (DKFZ), Universitätsmedizin Mannheim (Author)
  • S Haggenmüller - , German Cancer Research Center (DKFZ) (Author)
  • C Wies - , Heidelberg University , German Cancer Research Center (DKFZ) (Author)
  • J N Kather - , Department of Internal Medicine I, Else Kröner Fresenius Center for Digital Health, National Center for Tumor Diseases (NCT) Heidelberg (Author)
  • M S Michel - , Universitätsmedizin Mannheim (Author)
  • F Wessels - , Universitätsmedizin Mannheim (Author)
  • T J Brinker - , German Cancer Research Center (DKFZ) (Author)

Abstract

BACKGROUND: Large language models encode clinical knowledge and can answer medical expert questions out-of-the-box without further training. However, this zero-shot performance is limited by outdated training data and lack of explainability impeding clinical translation. We aimed to develop a urology-specialized chatbot (UroBot) and evaluate it against state-of-the-art models as well as historical urologists' performance in answering urological board questions in a fully clinician-verifiable manner.

MATERIALS AND METHODS: We developed UroBot, a software pipeline based on the GPT-3.5, GPT-4, and GPT-4o models by OpenAI, utilizing retrieval augmented generation and the 2023 European Association of Urology guidelines. UroBot was benchmarked against the zero-shot performance of GPT-3.5, GPT-4, GPT-4o, and Uro_Chat. The evaluation involved 10 runs with 200 European Board of Urology in-service assessment questions, with the performance measured by the mean rate of correct answers (RoCA).

RESULTS: UroBot-4o achieved the highest RoCA, with an average of 88.4%, outperforming GPT-4o (77.6%) by 10.8%. Besides, it is clinician-verifiable and demonstrated the highest level of agreement between runs as measured by Fleiss' kappa (κ = 0.979). In comparison, the average performance of urologists on urological board questions is 68.7% as reported by the literature.

CONCLUSIONS: UroBot is a clinician-verifiable and accurate software pipeline and outperforms published models and urologists in answering urology board questions. We provide code and instructions to use and extend UroBot for further development.

Details

Original languageEnglish
Article number100078
Journal ESMO real world data and digital oncology
Volume6
Publication statusPublished - Dec 2024
Peer-reviewedYes

External IDs

PubMedCentral PMC12836625
ORCID /0000-0002-3730-5348/work/211722522

Keywords