Benchmarking vision-language models for diagnostics in emergency and critical care settings

Research output: Contribution to journalResearch articleContributedpeer-review

Contributors

  • Christoph F. Kurz - , Novartis Pharma AG (Author)
  • Tatiana Merzhevich - , Friedrich-Alexander University Erlangen-Nürnberg (Author)
  • Bjoern M. Eskofier - , Friedrich-Alexander University Erlangen-Nürnberg, Helmholtz Zentrum München - German Research Center for Environmental Health (Author)
  • Jakob Nikolas Kather - , Department of Internal Medicine I, Else Kröner Fresenius Center for Digital Health, National Center for Tumor Diseases (NCT) Heidelberg (Author)
  • Benjamin Gmeiner - , Novartis Pharma AG (Author)

Abstract

The applicability of vision-language models (VLMs) for acute care in emergency and intensive care units remains underexplored. Using a multimodal dataset of diagnostic questions involving medical images and clinical context, we benchmarked several small open-source VLMs against GPT-4o. While open models demonstrated limited diagnostic accuracy (up to 40.4%), GPT-4o significantly outperformed them (68.1%). Findings highlight the need for specialized training and optimization to improve open-source VLMs for acute care applications.

Details

Original languageEnglish
Article number423
Journal npj digital medicine
Volume8
Issue number1
Publication statusPublished - Dec 2025
Peer-reviewedYes

External IDs

ORCID /0000-0002-3730-5348/work/198594679