Superhuman performance on urology board questions using an explainable language model enhanced with European Association of Urology guidelines

Publikation: Beitrag in FachzeitschriftForschungsartikelBeigetragenBegutachtung

Beitragende

  • M J Hetz - , Deutsches Krebsforschungszentrum (DKFZ) (Autor:in)
  • N Carl - , Deutsches Krebsforschungszentrum (DKFZ), Universitätsmedizin Mannheim (Autor:in)
  • S Haggenmüller - , Deutsches Krebsforschungszentrum (DKFZ) (Autor:in)
  • C Wies - , Universität Heidelberg, Deutsches Krebsforschungszentrum (DKFZ) (Autor:in)
  • J N Kather - , Medizinische Klinik und Poliklinik I, Else Kröner Fresenius Zentrum für Digitale Gesundheit, Nationales Zentrum für Tumorerkrankungen (NCT) Heidelberg (Autor:in)
  • M S Michel - , Universitätsmedizin Mannheim (Autor:in)
  • F Wessels - , Universitätsmedizin Mannheim (Autor:in)
  • T J Brinker - , Deutsches Krebsforschungszentrum (DKFZ) (Autor:in)

Abstract

BACKGROUND: Large language models encode clinical knowledge and can answer medical expert questions out-of-the-box without further training. However, this zero-shot performance is limited by outdated training data and lack of explainability impeding clinical translation. We aimed to develop a urology-specialized chatbot (UroBot) and evaluate it against state-of-the-art models as well as historical urologists' performance in answering urological board questions in a fully clinician-verifiable manner.

MATERIALS AND METHODS: We developed UroBot, a software pipeline based on the GPT-3.5, GPT-4, and GPT-4o models by OpenAI, utilizing retrieval augmented generation and the 2023 European Association of Urology guidelines. UroBot was benchmarked against the zero-shot performance of GPT-3.5, GPT-4, GPT-4o, and Uro_Chat. The evaluation involved 10 runs with 200 European Board of Urology in-service assessment questions, with the performance measured by the mean rate of correct answers (RoCA).

RESULTS: UroBot-4o achieved the highest RoCA, with an average of 88.4%, outperforming GPT-4o (77.6%) by 10.8%. Besides, it is clinician-verifiable and demonstrated the highest level of agreement between runs as measured by Fleiss' kappa (κ = 0.979). In comparison, the average performance of urologists on urological board questions is 68.7% as reported by the literature.

CONCLUSIONS: UroBot is a clinician-verifiable and accurate software pipeline and outperforms published models and urologists in answering urology board questions. We provide code and instructions to use and extend UroBot for further development.

Details

OriginalspracheEnglisch
Aufsatznummer100078
Fachzeitschrift ESMO real world data and digital oncology
Jahrgang6
PublikationsstatusVeröffentlicht - Dez. 2024
Peer-Review-StatusJa

Externe IDs

PubMedCentral PMC12836625
ORCID /0000-0002-3730-5348/work/211722522

Schlagworte