GeMTeX's De-Identification in Action: Lessons Learned & Devil's Details

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/GutachtenBeitrag in KonferenzbandBeigetragenBegutachtung

Beitragende

  • Christina Lohr - , Medizininformatik-Initiativen, Universität Leipzig (Autor:in)
  • Jakob Faller - , Medizininformatik-Initiativen, Universitätsklinikum der Friedrich-Alexander-Universität Erlangen-Nürnberg (Autor:in)
  • Andrea Riedel - , Medizininformatik-Initiativen, Universitätsklinikum der Friedrich-Alexander-Universität Erlangen-Nürnberg, Friedrich-Alexander-Universität Erlangen-Nürnberg (Autor:in)
  • Hung Manh Nguyen - , Institut für Medizinische Informatik und Biometrie, Medizininformatik-Initiativen (Autor:in)
  • Markus Wolfien - , Institut für Medizinische Informatik und Biometrie, Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden), Medizininformatik-Initiativen (Autor:in)
  • Justin Hofenbitzer - , Medizininformatik-Initiativen, Klinikum Rechts der Isar (MRI TUM) (Autor:in)
  • Luise Modersohn - , Medizininformatik-Initiativen, Klinikum Rechts der Isar (MRI TUM) (Autor:in)
  • Jutta Romberg - , Medizininformatik-Initiativen, Berliner Institut für Gesundheitsforschung in der Charité (Autor:in)
  • Fabian Prasser - , Medizininformatik-Initiativen, Berliner Institut für Gesundheitsforschung in der Charité (Autor:in)
  • Jazia Omeirat - , Medizininformatik-Initiativen, Universitätsklinikum Essen (Autor:in)
  • Yutong Wen - , Medizininformatik-Initiativen, Universitätsklinikum Essen (Autor:in)
  • Oksana Galusch - , Medizininformatik-Initiativen, Universitätsklinikum Leipzig (Autor:in)
  • Udo Hahn - , Medizininformatik-Initiativen, Universität Leipzig (Autor:in)
  • Marvin Seiferling - , Medizininformatik-Initiativen, Universitätsklinikum Heidelberg (Autor:in)
  • Christoph Dieterich - , Medizininformatik-Initiativen, Universitätsklinikum Heidelberg (Autor:in)
  • Peter Klügl - , Medizininformatik-Initiativen, Averbis GmbH (Autor:in)
  • Franz Matthies - , Medizininformatik-Initiativen, Universität Leipzig (Autor:in)
  • Janina Kind - , Medizininformatik-Initiativen, Universitätsklinikum Leipzig (Autor:in)
  • Martin Boeker - , Medizininformatik-Initiativen, Klinikum Rechts der Isar (MRI TUM) (Autor:in)
  • Markus Löffler - , Universität Leipzig, Universitätsklinikum Leipzig , Medizininformatik-Initiativen (Autor:in)
  • Frank Meineke - , Medizininformatik-Initiativen, Universität Leipzig (Autor:in)

Abstract

INTRODUCTION: In 2024, the GeMTeX project launched the largest ever de-identification campaign for German-language clinical reports, and, as a pilot study, published GraSCCoPHI, the first de-identified German-language gold standard corpus of synthetic discharge summaries.

METHODS: GeMTeX's de-identification workflow is described here - including annotation tool management and, pre-annotation experience, such as assembling and training annotation groups and the evolution of guidelines.

RESULTS: We present the project's progress in the first year with respect to de-identification efforts and the challenges we faced during the rollout at six hospital sites in four German states. The refinement of the annotation guidelines became an ongoing process, often with unforeseen hurdles to overcome as we moved from testing to production. From our current internal interim corpus (9,000 documents with about 20 million tokens), we are publishing the first quantitative insights, such as the average amount of identifiable information per document, a list of confounding factors we did not anticipate at the beginning of the project, and three key lessons learned.

CONCLUSION: We note that the unforeseen hurdles behave like the Pareto principle and fall into the set of less than 20% of the annotations.

Details

OriginalspracheEnglisch
TitelGerman Medical Data Sciences 2025: GMDS Illuminates Health
Redakteure/-innenRainer Rohrig, Thomas Ganslandt, Klaus Jung, Ann-Kristin Kock-Schoppenhauer, Ann-Kristin Kock-Schoppenhauer, Jochem Konig, Ulrich Sax, Martin Sedlmayr, Cord Spreckelsen, Antonia Zapf
Seiten274-282
Seitenumfang9
ISBN (elektronisch)978-1-64368-615-8
PublikationsstatusVeröffentlicht - 3 Sept. 2025
Peer-Review-StatusJa

Publikationsreihe

ReiheStudies in health technology and informatics
Band331
ISSN0926-9630

Externe IDs

Scopus 105015749930
ORCID /0000-0002-1887-4772/work/196688955

Schlagworte

Schlagwörter

  • Confidentiality, Data Anonymization, Electronic Health Records/organization & administration, Germany, Humans, Natural Language Processing, Patient Discharge Summaries/standards, Pilot Projects, Privacy, De-Identification