RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering

Publikation: Beitrag in FachzeitschriftForschungsartikelBeigetragenBegutachtung

Beitragende

  • Soroosh Tayebi Arasteh - , Universitätsklinikum Aachen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Stanford University (Autor:in)
  • Mahshad Lotfinia - , Universitätsklinikum Aachen (Autor:in)
  • Keno Bressem - , Deutsches Herzzentrum München, Klinikum Rechts der Isar (MRI TUM) (Autor:in)
  • Robert Siepmann - , Universitätsklinikum Aachen (Autor:in)
  • Lisa Adams - , Deutsches Herzzentrum München, Stanford University (Autor:in)
  • Dyke Ferber - , Else Kröner Fresenius Zentrum für Digitale Gesundheit, Nationales Zentrum für Tumorerkrankungen (NCT) Heidelberg (Autor:in)
  • Christiane Kuhl - , Universitätsklinikum Aachen (Autor:in)
  • Jakob Nikolas Kather - , Medizinische Klinik und Poliklinik I, Else Kröner Fresenius Zentrum für Digitale Gesundheit, Nationales Zentrum für Tumorerkrankungen (NCT) Heidelberg (Autor:in)
  • Sven Nebelung - , Universitätsklinikum Aachen (Autor:in)
  • Daniel Truhn - , Universitätsklinikum Aachen (Autor:in)

Abstract

Purpose To evaluate diagnostic accuracy of various large language models (LLMs) when answering radiology-specific questions with and without access to additional online, up-to-date information via retrieval-augmented generation (RAG). Materials and Methods The authors developed radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RAG incorporates information retrieval from external sources to supplement the initial prompt, grounding the model's response in relevant information. Using 80 questions from the RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions with reference standard answers, LLMs (GPT-3.5-turbo [OpenAI], GPT-4, Mistral 7B, Mixtral 8×7B [Mistral], and Llama3-8B and -70B [Meta]) were prompted with and without RadioRAG in a zero-shot inference scenario (temperature ≤ 0.1, top-p = 1). RadioRAG retrieved context-specific information from www.radiopaedia.org. Accuracy of LLMs with and without RadioRAG in answering questions from each dataset was assessed. Statistical analyses were performed using bootstrapping while preserving pairing. Additional assessments included comparison of model with human performance and comparison of time required for conventional versus RadioRAG-powered question answering. Results RadioRAG improved accuracy for some LLMs, including GPT-3.5-turbo (74% [59 of 80] vs 66% [53 of 80], false discovery rate [FDR] = 0.03) and Mixtral 8×7B (76% [61 of 80] vs 65% [52 of 80], FDR = 0.02) on the RSNA radiology question answering (RSNA-RadioQA) dataset, with similar trends in the ExtendedQA dataset. Accuracy exceeded that of a human expert (63% [50 of 80], FDR ≤ 0.007) for these LLMs, although not for Mistral 7B-instruct-v0.2, Llama3-8B, and Llama3-70B (FDR ≥ 0.21). RadioRAG reduced hallucinations for all LLMs (rate, 6%-25%). RadioRAG increased estimated response time fourfold. Conclusion RadioRAG shows potential to improve LLM accuracy and factuality in radiology QA by integrating real-time, domain-specific data. Keywords: Retrieval-augmented Generation, Informatics, Computer-aided Diagnosis, Large Language Models Supplemental material is available for this article. © RSNA, 2025.

Details

OriginalspracheEnglisch
Aufsatznummere240476
FachzeitschriftRadiology: Artificial Intelligence
Jahrgang7
Ausgabenummer4
Frühes Online-Datum18 Juni 2025
PublikationsstatusVeröffentlicht - Juli 2025
Peer-Review-StatusJa

Externe IDs

unpaywall 10.1148/ryai.240476
Scopus 105013604101
ORCID /0000-0002-3730-5348/work/198594691

Schlagworte

Schlagwörter

  • Humans, Radiology/education, Information Storage and Retrieval/methods, Internet