Limits...
Imputation of missing genotypes: an empirical evaluation of IMPUTE.

Zhao Z, Timofeev N, Hartley SW, Chui DH, Fucharoen S, Perls TT, Steinberg MH, Baldwin CT, Sebastiani P - BMC Genet. (2008)

Bottom Line: The accuracy decreases to 86% or 94% when subjects are African Americans or Asians.Our analysis suggests that IMPUTE is very accurate in samples of Caucasians origin, it is slightly less accurate in samples of Asians background, but substantially less accurate in samples of admixed background such as African Americans.Sample size and ascertainment do not seem to affect the accuracy of imputation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biostatistics, Boston University School of Public Health, 801 Massachusetts Avenue, Boston MA 02118, USA.

ABSTRACT

Background: Imputation of missing genotypes is becoming a very popular solution for synchronizing genotype data collected with different microarray platforms but the effect of ethnic background, subject ascertainment, and amount of missing data on the accuracy of imputation are not well understood.

Results: We evaluated the accuracy of the program IMPUTE to generate the genotype data of partially or fully untyped single nucleotide polymorphisms (SNPs). The program uses a model-based approach to imputation that reconstructs the genotype distribution given a set of referent haplotypes and the observed data, and uses this distribution to compute the marginal probability of each missing genotype for each individual subject that is used to impute the missing data. We assembled genome-wide data from five different studies and three different ethnic groups comprising Caucasians, African Americans and Asians. We randomly removed genotype data and then compared the observed genotypes with those generated by IMPUTE. Our analysis shows 97% median accuracy in Caucasian subjects when less than 10% of the SNPs are untyped and missing genotypes are accepted regardless of their posterior probability. The median accuracy increases to 99% when we require 0.95 minimum posterior probability for an imputed genotype to be acceptable. The accuracy decreases to 86% or 94% when subjects are African Americans or Asians. We propose a strategy to improve the accuracy by leveraging the level of admixture in African Americans.

Conclusion: Our analysis suggests that IMPUTE is very accurate in samples of Caucasians origin, it is slightly less accurate in samples of Asians background, but substantially less accurate in samples of admixed background such as African Americans. Sample size and ascertainment do not seem to affect the accuracy of imputation.

Show MeSH
Distribution of imputation accuracies when 1% of the SNPs were randomly selected from chromosome 21 and their genotype data completely removed in the NNC set. The results for other proportion of missing SNPs are in the supplementary material. In each of the 1,000 simulations we randomly selected 1% of the SNPs to be removed from the data and their genotype data to be imputed. The chromosome is tagged by approximately 5,900 SNPs, so that 59 SNPs were removed in each run, and 59,000 SNPs had to be imputed across all 1,000 simulations. The x-axis reports the accuracy of each of the 59,000 SNPs that were imputed in the 1,000 simulations. The y-axis reports the frequency of different imputation accuracies.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2636842&req=5

Figure 1: Distribution of imputation accuracies when 1% of the SNPs were randomly selected from chromosome 21 and their genotype data completely removed in the NNC set. The results for other proportion of missing SNPs are in the supplementary material. In each of the 1,000 simulations we randomly selected 1% of the SNPs to be removed from the data and their genotype data to be imputed. The chromosome is tagged by approximately 5,900 SNPs, so that 59 SNPs were removed in each run, and 59,000 SNPs had to be imputed across all 1,000 simulations. The x-axis reports the accuracy of each of the 59,000 SNPs that were imputed in the 1,000 simulations. The y-axis reports the frequency of different imputation accuracies.

Mentions: Table 1 shows the summary statistics of the accuracy of the method when we impute an increasing proportions of SNPs in chromosome 21 in the NNC set. The results confirm a very good accuracy of the imputation method when either 100% or 80% of genotypes are missing in up to 40% of the SNPs. In fact, more than 40% of the SNPs have to be missing to lower the median accuracy to less than 95%. The median accuracy increases to 99% when we impose a posterior probability greater than 0.95 as the threshold to accept the imputed genotypes. This increased accuracy competes with the ability to complete the data as only 71–82% of imputed genotypes were acceptable. Figure 1 shows the distribution of imputation accuracy when 100% of genotypes in 1% randomly selected SNPs were removed in the genotype data of chromosome 21 in the NNC set. The data are essentially those summarized in column 2 of Table 1 and show a clear skewness of the results with a very small number of SNPs that failed to be imputed correctly while the majority of SNPs was imputed with large accuracy. We examined 30 SNPs with very low accuracy and found that most of them are in recombination hotspots which were estimated from Phase II Hapmap data.


Imputation of missing genotypes: an empirical evaluation of IMPUTE.

Zhao Z, Timofeev N, Hartley SW, Chui DH, Fucharoen S, Perls TT, Steinberg MH, Baldwin CT, Sebastiani P - BMC Genet. (2008)

Distribution of imputation accuracies when 1% of the SNPs were randomly selected from chromosome 21 and their genotype data completely removed in the NNC set. The results for other proportion of missing SNPs are in the supplementary material. In each of the 1,000 simulations we randomly selected 1% of the SNPs to be removed from the data and their genotype data to be imputed. The chromosome is tagged by approximately 5,900 SNPs, so that 59 SNPs were removed in each run, and 59,000 SNPs had to be imputed across all 1,000 simulations. The x-axis reports the accuracy of each of the 59,000 SNPs that were imputed in the 1,000 simulations. The y-axis reports the frequency of different imputation accuracies.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2636842&req=5

Figure 1: Distribution of imputation accuracies when 1% of the SNPs were randomly selected from chromosome 21 and their genotype data completely removed in the NNC set. The results for other proportion of missing SNPs are in the supplementary material. In each of the 1,000 simulations we randomly selected 1% of the SNPs to be removed from the data and their genotype data to be imputed. The chromosome is tagged by approximately 5,900 SNPs, so that 59 SNPs were removed in each run, and 59,000 SNPs had to be imputed across all 1,000 simulations. The x-axis reports the accuracy of each of the 59,000 SNPs that were imputed in the 1,000 simulations. The y-axis reports the frequency of different imputation accuracies.
Mentions: Table 1 shows the summary statistics of the accuracy of the method when we impute an increasing proportions of SNPs in chromosome 21 in the NNC set. The results confirm a very good accuracy of the imputation method when either 100% or 80% of genotypes are missing in up to 40% of the SNPs. In fact, more than 40% of the SNPs have to be missing to lower the median accuracy to less than 95%. The median accuracy increases to 99% when we impose a posterior probability greater than 0.95 as the threshold to accept the imputed genotypes. This increased accuracy competes with the ability to complete the data as only 71–82% of imputed genotypes were acceptable. Figure 1 shows the distribution of imputation accuracy when 100% of genotypes in 1% randomly selected SNPs were removed in the genotype data of chromosome 21 in the NNC set. The data are essentially those summarized in column 2 of Table 1 and show a clear skewness of the results with a very small number of SNPs that failed to be imputed correctly while the majority of SNPs was imputed with large accuracy. We examined 30 SNPs with very low accuracy and found that most of them are in recombination hotspots which were estimated from Phase II Hapmap data.

Bottom Line: The accuracy decreases to 86% or 94% when subjects are African Americans or Asians.Our analysis suggests that IMPUTE is very accurate in samples of Caucasians origin, it is slightly less accurate in samples of Asians background, but substantially less accurate in samples of admixed background such as African Americans.Sample size and ascertainment do not seem to affect the accuracy of imputation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biostatistics, Boston University School of Public Health, 801 Massachusetts Avenue, Boston MA 02118, USA.

ABSTRACT

Background: Imputation of missing genotypes is becoming a very popular solution for synchronizing genotype data collected with different microarray platforms but the effect of ethnic background, subject ascertainment, and amount of missing data on the accuracy of imputation are not well understood.

Results: We evaluated the accuracy of the program IMPUTE to generate the genotype data of partially or fully untyped single nucleotide polymorphisms (SNPs). The program uses a model-based approach to imputation that reconstructs the genotype distribution given a set of referent haplotypes and the observed data, and uses this distribution to compute the marginal probability of each missing genotype for each individual subject that is used to impute the missing data. We assembled genome-wide data from five different studies and three different ethnic groups comprising Caucasians, African Americans and Asians. We randomly removed genotype data and then compared the observed genotypes with those generated by IMPUTE. Our analysis shows 97% median accuracy in Caucasian subjects when less than 10% of the SNPs are untyped and missing genotypes are accepted regardless of their posterior probability. The median accuracy increases to 99% when we require 0.95 minimum posterior probability for an imputed genotype to be acceptable. The accuracy decreases to 86% or 94% when subjects are African Americans or Asians. We propose a strategy to improve the accuracy by leveraging the level of admixture in African Americans.

Conclusion: Our analysis suggests that IMPUTE is very accurate in samples of Caucasians origin, it is slightly less accurate in samples of Asians background, but substantially less accurate in samples of admixed background such as African Americans. Sample size and ascertainment do not seem to affect the accuracy of imputation.

Show MeSH