BACKGROUND: Genotype imputation is a common technique in genetic research. Genetic similarity between target population and reference dataset is crucial for high-quality results. Although several reference panels are available, it is often not clear which is the most optimal for a particular target dataset to be imputed. Maximizing genetic similarity between study sample and intended reference panels may be the straight forward method for selecting the genetically best-matched reference. However, the impact of genetic similarity on imputation accuracy has not yet been studied in detail. RESULTS: We performed a simulation study in 20 ethnic groups obtained from POPRES. High-quality SNPs were masked and re-imputed with MaCH, MaCH-minimac and IMPUTE2 using four different HapMap reference panels (CEU, CHB-JPT, MEX and YRI). Imputation accuracy was assessed by different statistics. Genetic similarity between ethnic groups and reference populations were measured by F -statistics (F(ST)) originally proposed by Wright and G -statistics (G(ST)) introduced by Nei and others. To assess the predictive power of these measures regarding imputation accuracy, we analysed relations between them and corresponding imputation accuracy scores. We found that population genetic distances between homogeneous reference and target populations were strongly linearly correlated with resulting imputation accuracies irrespective of considered distance measure, imputation accuracy measure, missingness and imputation software used. Possible exception was African population. CONCLUSION: Usage of G(ST) or F(ST)-related measures for predicting the optimal reference panel for imputation frameworks relying on a specific reference is highly recommended. A cut-off of G(ST) < 0.01 is recommended to achieve good imputation results for high-frequency variants and small data sets. The linear relationship is less pronounced for low-frequency variants for which we also observed a dependence of imputation accuracy on the number of polymorphic sites in the reference. We also show that the software specific measures MaCH-Rsq and IMPUTE-info must be interpreted with caution if the genetic distance of target and reference population is high.
PubMed ID: 26193934
Projects: Genetical Statistics and Systems Biology, LIFE - Leipzig Research Center for Civilization Diseases
Publication type: Not specified
Journal: BMC Genet
Human Diseases: No Human Disease specified
Citation: BMC Genet. 2015 Jul 22;16:90. doi: 10.1186/s12863-015-0248-2.
Date Published: 22nd Jul 2015
Registered Mode: by PubMed ID
Views: 1969
Created: 9th May 2019 at 10:56
Last updated: 7th Dec 2021 at 17:58
This item has not yet been tagged.
None