A clinical environment simulator for dynamic AI evaluation

Research output: Contribution to journalReview articleContributedpeer-review

Contributors

  • Luyang Luo - , Harvard University (Author)
  • Sung Eun Kim - , Harvard University, Seoul National University (Author)
  • Xiaoman Zhang - , Harvard University (Author)
  • Julius M. Kernbach - , Heidelberg University  (Author)
  • Roshan Kenia - , Harvard University (Author)
  • Julian N. Acosta - , Harvard University (Author)
  • Larry A. Nathanson - , Beth Israel Deaconess Medical Center (BIDMC) (Author)
  • Adrian D. Haimovich - , Beth Israel Deaconess Medical Center (BIDMC) (Author)
  • Adam Rodman - , Beth Israel Deaconess Medical Center (BIDMC) (Author)
  • Ethan Goh - , Stanford University (Author)
  • Jonathan H. Chen - , Stanford University (Author)
  • Nigam H. Shah - , Stanford University (Author)
  • David A. Kim - , Stanford University (Author)
  • James Zou - , Stanford University (Author)
  • Faisal Mahmood - , Partners HealthCare, Harvard University, Broad Institute of Harvard University and MIT (Author)
  • Jakob Nikolas Kather - , Else Kröner Fresenius Center for Digital Health, National Center for Tumor Diseases (NCT) Heidelberg (Author)
  • Matthew Lungren - , Microsoft Research (Author)
  • Vivek Natarajan - , Alphabet Inc. (Author)
  • Eric J. Topol - , Scripps Research Translational Institute (Author)
  • Pranav Rajpurkar - , Harvard University (Author)

Abstract

Clinical evaluation of large language models (LLMs) currently relies on static datasets and isolated scenarios that fail to capture the cascading effects of healthcare decisions. We propose the Clinical Environment Simulator (CES), a framework that evaluates clinical LLMs within digital hospital environments where every decision dynamically alters future states. The CES would use a parallel simulation architecture: a ‘hospital engine’ that tracks bed availability, staff workloads and equipment status in real time, and a ‘patient engine’ that simulates disease progression and treatment responses based on LLM interventions. Unlike current benchmarks, the CES framework requires clinical LLMs to execute decisions through realistic electronic health record interfaces, while managing trade-offs between individual patient optimization and system-wide efficiency. The CES enables three critical evaluations absent from current benchmarks: temporal reasoning under evolving constraints, where delayed diagnostics can lead to patient deterioration; resource-aware decision-making, where aggressive workups for one patient may exhaust capacity needed by others; and operational resilience, through adversarial testing with simultaneous emergencies and system failures. By scoring LLM performance on both clinical outcomes and operational metrics, the CES represents a shift toward evaluating clinical LLMs as a dynamic and integrated component of healthcare delivery systems.

Details

Original languageEnglish
Pages (from-to)820-827
Number of pages8
JournalNature medicine
Volume32
Issue number3
Publication statusPublished - Mar 2026
Peer-reviewedYes

External IDs

PubMed 41820673
ORCID /0000-0002-3730-5348/work/212492323