A clinical environment simulator for dynamic AI evaluation

Luyang Luo; Sung Eun Kim; Xiaoman Zhang; Julius M. Kernbach; Roshan Kenia; Julian N. Acosta; Larry A. Nathanson; Adrian D. Haimovich; Adam Rodman; Ethan Goh; Jonathan H. Chen; Nigam H. Shah; David A. Kim; James Zou; Faisal Mahmood; Jakob Nikolas Kather; Matthew Lungren; Vivek Natarajan; Eric J. Topol; Pranav Rajpurkar

doi:10.1038/s41591-026-04252-6

A clinical environment simulator for dynamic AI evaluation

Research output: Contribution to journal › Review article › Contributed › peer-review

Contributors

Luyang Luo - , Harvard University (Author)
Sung Eun Kim - , Harvard University, Seoul National University (Author)
Xiaoman Zhang - , Harvard University (Author)
Julius M. Kernbach - , Heidelberg University (Author)
Roshan Kenia - , Harvard University (Author)
Julian N. Acosta - , Harvard University (Author)
Larry A. Nathanson - , Beth Israel Deaconess Medical Center (BIDMC) (Author)
Adrian D. Haimovich - , Beth Israel Deaconess Medical Center (BIDMC) (Author)
Adam Rodman - , Beth Israel Deaconess Medical Center (BIDMC) (Author)
Ethan Goh - , Stanford University (Author)
Jonathan H. Chen - , Stanford University (Author)
Nigam H. Shah - , Stanford University (Author)
David A. Kim - , Stanford University (Author)
James Zou - , Stanford University (Author)
Faisal Mahmood - , Partners HealthCare, Harvard University, Broad Institute of Harvard University and MIT (Author)
Jakob Nikolas Kather - , Else Kröner Fresenius Center for Digital Health, National Center for Tumor Diseases (NCT) Heidelberg (Author)
Matthew Lungren - , Microsoft Research (Author)
Vivek Natarajan - , Alphabet Inc. (Author)
Eric J. Topol - , Scripps Research Translational Institute (Author)
Pranav Rajpurkar - , Harvard University (Author)

Abstract

Clinical evaluation of large language models (LLMs) currently relies on static datasets and isolated scenarios that fail to capture the cascading effects of healthcare decisions. We propose the Clinical Environment Simulator (CES), a framework that evaluates clinical LLMs within digital hospital environments where every decision dynamically alters future states. The CES would use a parallel simulation architecture: a ‘hospital engine’ that tracks bed availability, staff workloads and equipment status in real time, and a ‘patient engine’ that simulates disease progression and treatment responses based on LLM interventions. Unlike current benchmarks, the CES framework requires clinical LLMs to execute decisions through realistic electronic health record interfaces, while managing trade-offs between individual patient optimization and system-wide efficiency. The CES enables three critical evaluations absent from current benchmarks: temporal reasoning under evolving constraints, where delayed diagnostics can lead to patient deterioration; resource-aware decision-making, where aggressive workups for one patient may exhaust capacity needed by others; and operational resilience, through adversarial testing with simultaneous emergencies and system failures. By scoring LLM performance on both clinical outcomes and operational metrics, the CES represents a shift toward evaluating clinical LLMs as a dynamic and integrated component of healthcare delivery systems.

Details

Original language	English
Pages (from-to)	820-827
Number of pages	8
Journal	Nature medicine
Volume	32
Issue number	3
Publication status	Published - Mar 2026
Peer-reviewed	Yes

External IDs

PubMed	41820673
ORCID	/0000-0002-3730-5348/work/212492323

Research Portal of the TU Dresden

A clinical environment simulator for dynamic AI evaluation

Contributors

Abstract

Details

External IDs

Keywords

ASJC Scopus subject areas