Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4)

Daniel Truhn; Chiara M.L. Loeffler; Gustav Müller-Franzes; Sven Nebelung; Katherine J. Hewitt; Sebastian Brandner; Keno K. Bressem; Sebastian Foersch; Jakob Nikolas Kather

doi:10.1002/path.6232

Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4)

Research output: Contribution to journal › Research article › Contributed › peer-review

Contributors

Daniel Truhn - , RWTH Aachen University (Author)
Chiara M.L. Loeffler - , Else Kröner Fresenius Center for Digital Health, Department of Internal Medicine I, University Hospital Aachen (Author)
Gustav Müller-Franzes - , RWTH Aachen University (Author)
Sven Nebelung - , RWTH Aachen University (Author)
Katherine J. Hewitt - , Else Kröner Fresenius Center for Digital Health, University Hospital Aachen (Author)
Sebastian Brandner - , Friedrich-Alexander University Erlangen-Nürnberg (Author)
Keno K. Bressem - , Charité – Universitätsmedizin Berlin (Author)
Sebastian Foersch - , Johannes Gutenberg University Mainz (Author)
Jakob Nikolas Kather - , Else Kröner Fresenius Center for Digital Health, Department of Internal Medicine I, Heidelberg University , University of Leeds (Author)

Abstract

Deep learning applied to whole-slide histopathology images (WSIs) has the potential to enhance precision oncology and alleviate the workload of experts. However, developing these models necessitates large amounts of data with ground truth labels, which can be both time-consuming and expensive to obtain. Pathology reports are typically unstructured or poorly structured texts, and efforts to implement structured reporting templates have been unsuccessful, as these efforts lead to perceived extra workload. In this study, we hypothesised that large language models (LLMs), such as the generative pre-trained transformer 4 (GPT-4), can extract structured data from unstructured plain language reports using a zero-shot approach without requiring any re-training. We tested this hypothesis by utilising GPT-4 to extract information from histopathological reports, focusing on two extensive sets of pathology reports for colorectal cancer and glioblastoma. We found a high concordance between LLM-generated structured data and human-generated structured data. Consequently, LLMs could potentially be employed routinely to extract ground truth data for machine learning from unstructured pathology reports in the future.

Details

Original language	English
Pages (from-to)	310-319
Number of pages	10
Journal	Journal of pathology
Volume	262(2024)
Issue number	3
Publication status	Published - Mar 2024
Peer-reviewed	Yes

External IDs

Mendeley	e5fceb30-77fd-3047-a650-21d3ae19355b
ORCID	/0000-0002-3730-5348/work/198594484

Keywords

Sustainable Development Goals

SDG 3 - Good Health and Well-being

ASJC Scopus subject areas

Pathology and Forensic Medicine

Keywords

artificial intelligence, large language models, named entity recognition, natural language processing, pathology report, text mining

Research Portal of the TU Dresden

Contributors

Abstract

Details

External IDs

Keywords

Sustainable Development Goals

ASJC Scopus subject areas

Keywords