Unconditional latent diffusion models memorize patient imaging data

Research output: Contribution to journalResearch articleContributedpeer-review

Contributors

  • Salman Ul Hassan Dar - , Heidelberg University , Health + Life Science Alliance Heidelberg Mannheim, Deutsches Zentrum für Herz-Kreislaufforschung (DZHK), University Hospital Heidelberg (Author)
  • Marvin Seyfarth - , Heidelberg University , Deutsches Zentrum für Herz-Kreislaufforschung (DZHK), University Hospital Heidelberg (Author)
  • Isabelle Ayx - , Universitätsmedizin Mannheim (Author)
  • Theano Papavassiliu - , Health + Life Science Alliance Heidelberg Mannheim, Deutsches Zentrum für Herz-Kreislaufforschung (DZHK), Heidelberg University  (Author)
  • Stefan O. Schoenberg - , Health + Life Science Alliance Heidelberg Mannheim, Universitätsmedizin Mannheim (Author)
  • Robert Malte Siepmann - , University Hospital Aachen (Author)
  • Fabian Christopher Laqua - , University Hospital of Würzburg (Author)
  • Jannik Kahmann - , Universitätsmedizin Mannheim (Author)
  • Norbert Frey - , Heidelberg University , Deutsches Zentrum für Herz-Kreislaufforschung (DZHK), University Hospital Heidelberg (Author)
  • Bettina Baeßler - , University Hospital of Würzburg (Author)
  • Sebastian Foersch - , University Medical Center Mainz (Author)
  • Daniel Truhn - , University Hospital Aachen (Author)
  • Jakob Nikolas Kather - , Department of Internal Medicine I, Else Kröner Fresenius Center for Digital Health, National Center for Tumor Diseases (NCT) Heidelberg (Author)
  • Sandy Engelhardt - , Heidelberg University , Health + Life Science Alliance Heidelberg Mannheim, Deutsches Zentrum für Herz-Kreislaufforschung (DZHK), University Hospital Heidelberg (Author)

Abstract

Generative artificial intelligence models facilitate open-data sharing by proposing synthetic data as surrogates of real patient data. Despite the promise for healthcare, some of these models are susceptible to patient data memorization, where models generate patient data copies instead of novel synthetic samples, resulting in patient re-identification. Here we assess memorization in unconditional latent diffusion models by training them on a variety of datasets for synthetic data generation and detecting memorization with a self-supervised copy detection approach. We show a high degree of patient data memorization across all datasets, with approximately 37.2% of patient data detected as memorized and 68.7% of synthetic samples identified as patient data copies. Latent diffusion models are more susceptible to memorization than autoencoders and generative adversarial networks, and they outperform non-diffusion models in synthesis quality. Augmentation strategies during training, small architecture size and increasing datasets can reduce memorization, while overtraining the models can enhance it. These results emphasize the importance of carefully training generative models on private medical imaging datasets and examining the synthetic data to ensure patient privacy.

Details

Original languageEnglish
JournalNature biomedical engineering
Volume2025
Publication statusE-pub ahead of print - 11 Aug 2025
Peer-reviewedYes

External IDs

ORCID /0000-0002-3730-5348/work/198594702