Haplotype reconstruction for genetically complex regions with ambiguous genotype calls: Illustration by the KIR gene region

Publikation: Beitrag in FachzeitschriftForschungsartikelBeigetragenBegutachtung

Beitragende

  • Lars L J van der Burg - , Biomedical Data Sciences (Autor:in)
  • Liesbeth C de Wreede - , Biomedical Data Sciences (Autor:in)
  • Henning Baldauf - , DKMS Clinical Trials Unit gGmbH (Autor:in)
  • Jürgen Sauter - , DKMS Clinical Trials Unit gGmbH (Autor:in)
  • Johannes Schetelig - , Medizinische Klinik und Poliklinik I, DKMS Clinical Trials Unit gGmbH (Autor:in)
  • Hein Putter - , Biomedical Data Sciences (Autor:in)
  • Stefan Böhringer - , Biomedical Data Sciences (Autor:in)

Abstract

Advances in DNA sequencing technologies have enabled genotyping of complex genetic regions exhibiting copy number variation and high allelic diversity, yet it is impossible to derive exact genotypes in all cases, often resulting in ambiguous genotype calls, that is, partially missing data. An example of such a gene region is the killer-cell immunoglobulin-like receptor (KIR) genes. These genes are of special interest in the context of allogeneic hematopoietic stem cell transplantation. For such complex gene regions, current haplotype reconstruction methods are not feasible as they cannot cope with the complexity of the data. We present an expectation-maximization (EM)-algorithm to estimate haplotype frequencies (HTFs) which deals with the missing data components, and takes into account linkage disequilibrium (LD) between genes. To cope with the exponential increase in the number of haplotypes as genes are added, we add three components to a standard EM-algorithm implementation. First, reconstruction is performed iteratively, adding one gene at a time. Second, after each step, haplotypes with frequencies below a threshold are collapsed in a rare haplotype group. Third, the HTF of the rare haplotype group is profiled in subsequent iterations to improve estimates. A simulation study evaluates the effect of combining information of multiple genes on the estimates of these frequencies. We show that estimated HTFs are approximately unbiased. Our simulation study shows that the EM-algorithm is able to combine information from multiple genes when LD is high, whereas increased ambiguity levels increase bias. Linear regression models based on this EM, show that a large number of haplotypes can be problematic for unbiased effect size estimation and that models need to be sparse. In a real data analysis of KIR genotypes, we compare HTFs to those obtained in an independent study. Our new EM-algorithm-based method is the first to account for the full genetic architecture of complex gene regions, such as the KIR gene region. This algorithm can handle the numerous observed ambiguities, and allows for the collapsing of haplotypes to perform implicit dimension reduction. Combining information from multiple genes improves haplotype reconstruction.

Details

OriginalspracheEnglisch
FachzeitschriftGenetic epidemiology : the official publication of the International Genetic Epidemiology Society
PublikationsstatusElektronische Veröffentlichung vor Drucklegung - 13 Okt. 2023
Peer-Review-StatusJa

Externe IDs

Scopus 85174034874