High-dimensional, outcome-dependent missing data problems: Models for the human KIR loci
Publikation: Beitrag in Fachzeitschrift › Forschungsartikel › Beigetragen › Begutachtung
Beitragende
Abstract
Missing data problems are common in biological, high-dimensional data, where data can be partially or completely missing. Algorithms have been developed to reconstruct the missing values by means of imputation or expectation-maximization algorithms. For missing data problems, it has been suggested that the regression model of interest should be incorporated into the imputation procedure to reduce bias of the regression coefficients. We here consider a challenging missing data problem, where diplotypes of the KIR loci are to be reconstructed. These loci are difficult to genotype, resulting in ambiguous genotype calls. We extend a previously proposed expectation-maximization algorithm by incorporating a potentially high-dimensional regression model to model the outcome. Three strategies are evaluated: (1) only allelic predictors, (2) allelic predictors and forward-backward selection on haplotype predictors, and (3) penalized regression on a saturated model. In a simulation study, we compared these strategies with a baseline expectation-maximization algorithm without outcome model. For extreme choices of effect sizes and missingness levels, the outcome-based expectation-maximization algorithms outperformed the no-outcome expectation-maximization algorithm. However, in all other cases, the no-outcome expectation-maximization algorithm performed either superior or comparable to the three strategies, suggesting the outcome model can have a harmful effect. In a data analysis concerning death after allogeneic hematopoietic stem cell transplantation as a function of donor KIR genes, expectation-maximization algorithms with and without outcome showed very similar results. In conclusion, outcome based missing data models in the high-dimensional setting have to be used with care and are likely to lead to biased results.
Details
| Originalsprache | Englisch |
|---|---|
| Seiten (von - bis) | 440-456 |
| Seitenumfang | 17 |
| Fachzeitschrift | Statistical Methods in Medical Research |
| Jahrgang | 34 |
| Ausgabenummer | 3 |
| Publikationsstatus | Veröffentlicht - März 2025 |
| Peer-Review-Status | Ja |
Externe IDs
| PubMed | 39885761 |
|---|
Schlagworte
ASJC Scopus Sachgebiete
Schlagwörter
- expectation-maximization algorithm, haplotype reconstruction, KIR genes, Missing data, multiple imputation, outcome dependent imputation