Statistical biases due to anonymization evaluated in an open clinical dataset from COVID-19 patients

Research output: Contribution to journalResearch articleContributedpeer-review

Contributors

  • NAPKON Study Group - (Author)
  • Uniklinik Köln
  • Charité – Universitätsmedizin Berlin
  • University Hospital of Würzburg
  • University Hospital Bielefeld
  • University Hospital Schleswig-Holstein Campus Kiel
  • Hospital of the Ludwig-Maximilians-University (LMU) Munich
  • Klinikum Rechts der Isar (MRI TUM)
  • University Hospital Frankfurt
  • University Hospital Regensburg
  • Justus Liebig University Giessen
  • Institute for Prevention and Occupational Medicine of the German Social Accident Insurance (IPA)

Abstract

Anonymization has the potential to foster the sharing of medical data. State-of-the-art methods use mathematical models to modify data to reduce privacy risks. However, the degree of protection must be balanced against the impact on statistical properties. We studied an extreme case of this trade-off: the statistical validity of an open medical dataset based on the German National Pandemic Cohort Network (NAPKON), which was prepared for publication using a strong anonymization procedure. Descriptive statistics and results of regression analyses were compared before and after anonymization of multiple variants of the original dataset. Despite significant differences in value distributions, the statistical bias was found to be small in all cases. In the regression analyses, the median absolute deviations of the estimated adjusted odds ratios for different sample sizes ranged from 0.01 [minimum = 0, maximum = 0.58] to 0.52 [minimum = 0.25, maximum = 0.91]. Disproportionate impact on the statistical properties of data is a common argument against the use of anonymization. Our analysis demonstrates that anonymization can actually preserve validity of statistical results in relatively low-dimensional data.

Details

Original languageEnglish
Article number776
Pages (from-to)1-15
Number of pages15
JournalScientific data
Volume9
Issue number1
Publication statusPublished - 21 Dec 2022
Peer-reviewedYes
Externally publishedYes

External IDs

PubMedCentral PMC9769467
Scopus 85144597072

Keywords

Keywords

  • Humans, Bias, COVID-19, Data Anonymization, Models, Theoretical, Privacy, Data Interpretation, Statistical, Datasets as Topic

Library keywords