LongHealth: A Question Answering Benchmark with Long Clinical Documents

Lisa Adams; Felix Busch; Tianyu Han; Jean Baptiste Excoffier; Matthieu Ortala; Alexander Löser; Hugo J.W.L. Aerts; Jakob Nikolas Kather; Daniel Truhn; Keno Bressem

doi:10.1007/s41666-025-00204-w

LongHealth: A Question Answering Benchmark with Long Clinical Documents

Publikation: Beitrag in Fachzeitschrift › Forschungsartikel › Beigetragen › Begutachtung

Beitragende

Lisa Adams - , Technische Universität München (Autor:in)
Felix Busch - , Technische Universität München, Charité – Universitätsmedizin Berlin (Autor:in)
Tianyu Han - , Universitätsklinikum Aachen (Autor:in)
Jean Baptiste Excoffier - , Kaduceo (Autor:in)
Matthieu Ortala - , Kaduceo (Autor:in)
Alexander Löser - , Berliner Hochschule für Technik (Autor:in)
Hugo J.W.L. Aerts - , Maastricht University, Harvard Medical School (HMS) (Autor:in)
Jakob Nikolas Kather - , Medizinische Klinik und Poliklinik I, Else Kröner Fresenius Zentrum für Digitale Gesundheit, Nationales Zentrum für Tumorerkrankungen (NCT) Heidelberg (Autor:in)
Daniel Truhn - , Universitätsklinikum Aachen (Autor:in)
Keno Bressem - , Technische Universität München (Autor:in)

Abstract

Recent advancements in large language models (LLMs) offer potential benefits in healthcare, particularly in processing extensive patient records. However, existing benchmarks do not fully assess LLMs’ capability in handling real-world, lengthy clinical data. We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases, with each case containing 5090 to 6754 words. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting, challenging LLMs to extract and interpret information from large clinical documents. We evaluated eleven open-source LLMs with a minimum of 16,000 tokens and also included OpenAI’s proprietary and cost-efficient Generative Pre-trained Transformers-3.5 Turbo for comparison. The highest accuracy was observed for Mistral-Small-24B-Instruct-2501 and Llama-4-Scout-17B-16E-Instruct, particularly in tasks focused on information retrieval from single and multiple patient documents. However, all models struggled significantly in tasks requiring the identification of missing information, highlighting a critical area for improvement in clinical data. In conclusion, while LLMs show considerable potential for processing long clinical documents, their current accuracy levels are insufficient for reliable clinical use, especially in scenarios requiring the identification of missing information. The LongHealth benchmark provides a more realistic assessment of LLMs in a healthcare setting and highlights the need for further model refinement for safe and effective clinical application. We make the benchmark and evaluation code publicly available.

Details

Originalsprache	Englisch
Seiten (von - bis)	280–296
Seitenumfang	17
Fachzeitschrift	Journal of Healthcare Informatics Research
Jahrgang	9
Ausgabenummer	3
Frühes Online-Datum	14 Juni 2025
Publikationsstatus	Veröffentlicht - Sept. 2025
Peer-Review-Status	Ja

Externe IDs

ORCID	/0000-0002-3730-5348/work/198594676

Schlagworte

ASJC Scopus Sachgebiete

Schlagwörter

Benchmark, Discharge notes, Electronic health records, Healthcare, Large language models, Medical question answering

Forschungsportal der TU Dresden