GeMTeX's De-Identification in Action: Lessons Learned & Devil's Details

Research output: Contribution to book/Conference proceedings/Anthology/ReportConference contributionContributedpeer-review

Contributors

  • Christina Lohr - , Medical Informatics Initiative in Germany, Leipzig University (Author)
  • Jakob Faller - , Medical Informatics Initiative in Germany, University Hospital at the Friedrich-Alexander University Erlangen-Nürnberg (Author)
  • Andrea Riedel - , Medical Informatics Initiative in Germany, University Hospital at the Friedrich-Alexander University Erlangen-Nürnberg, Friedrich-Alexander University Erlangen-Nürnberg (Author)
  • Hung Manh Nguyen - , Institute for Medical Informatics and Biometry, Medical Informatics Initiative in Germany (Author)
  • Markus Wolfien - , Institute for Medical Informatics and Biometry, Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden), Medical Informatics Initiative in Germany (Author)
  • Justin Hofenbitzer - , Medical Informatics Initiative in Germany, Klinikum Rechts der Isar (MRI TUM) (Author)
  • Luise Modersohn - , Medical Informatics Initiative in Germany, Klinikum Rechts der Isar (MRI TUM) (Author)
  • Jutta Romberg - , Medical Informatics Initiative in Germany, Berlin Institute of Health at Charité (Author)
  • Fabian Prasser - , Medical Informatics Initiative in Germany, Berlin Institute of Health at Charité (Author)
  • Jazia Omeirat - , Medical Informatics Initiative in Germany, University Hospital Essen (Author)
  • Yutong Wen - , Medical Informatics Initiative in Germany, University Hospital Essen (Author)
  • Oksana Galusch - , Medical Informatics Initiative in Germany, University Hospital Leipzig (Author)
  • Udo Hahn - , Medical Informatics Initiative in Germany, Leipzig University (Author)
  • Marvin Seiferling - , Medical Informatics Initiative in Germany, University Hospital Heidelberg (Author)
  • Christoph Dieterich - , Medical Informatics Initiative in Germany, University Hospital Heidelberg (Author)
  • Peter Klügl - , Medical Informatics Initiative in Germany, Averbis GmbH (Author)
  • Franz Matthies - , Medical Informatics Initiative in Germany, Leipzig University (Author)
  • Janina Kind - , Medical Informatics Initiative in Germany, University Hospital Leipzig (Author)
  • Martin Boeker - , Medical Informatics Initiative in Germany, Klinikum Rechts der Isar (MRI TUM) (Author)
  • Markus Löffler - , Leipzig University, University Hospital Leipzig, Medical Informatics Initiative in Germany (Author)
  • Frank Meineke - , Medical Informatics Initiative in Germany, Leipzig University (Author)

Abstract

INTRODUCTION: In 2024, the GeMTeX project launched the largest ever de-identification campaign for German-language clinical reports, and, as a pilot study, published GraSCCoPHI, the first de-identified German-language gold standard corpus of synthetic discharge summaries.

METHODS: GeMTeX's de-identification workflow is described here - including annotation tool management and, pre-annotation experience, such as assembling and training annotation groups and the evolution of guidelines.

RESULTS: We present the project's progress in the first year with respect to de-identification efforts and the challenges we faced during the rollout at six hospital sites in four German states. The refinement of the annotation guidelines became an ongoing process, often with unforeseen hurdles to overcome as we moved from testing to production. From our current internal interim corpus (9,000 documents with about 20 million tokens), we are publishing the first quantitative insights, such as the average amount of identifiable information per document, a list of confounding factors we did not anticipate at the beginning of the project, and three key lessons learned.

CONCLUSION: We note that the unforeseen hurdles behave like the Pareto principle and fall into the set of less than 20% of the annotations.

Details

Original languageEnglish
Title of host publicationGerman Medical Data Sciences 2025: GMDS Illuminates Health
EditorsRainer Rohrig, Thomas Ganslandt, Klaus Jung, Ann-Kristin Kock-Schoppenhauer, Ann-Kristin Kock-Schoppenhauer, Jochem Konig, Ulrich Sax, Martin Sedlmayr, Cord Spreckelsen, Antonia Zapf
Pages274-282
Number of pages9
ISBN (electronic)978-1-64368-615-8
Publication statusPublished - 3 Sept 2025
Peer-reviewedYes

Publication series

SeriesStudies in health technology and informatics
Volume331
ISSN0926-9630

External IDs

Scopus 105015749930
ORCID /0000-0002-1887-4772/work/196688955

Keywords

Keywords

  • Confidentiality, Data Anonymization, Electronic Health Records/organization & administration, Germany, Humans, Natural Language Processing, Patient Discharge Summaries/standards, Pilot Projects, Privacy, De-Identification