Limits...
A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.

Howie BN, Donnelly P, Marchini J - PLoS Genet. (2009)

Bottom Line: We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made.We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%-20% lower than those of the closest competing method.One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, University of Oxford, Oxford, UK.

ABSTRACT
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%-20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.

Show MeSH
Schematic drawing of imputation Scenario B.In this drawing, haplotypes are represented as horizontal boxes containing 0's and 1's (for alternate SNP alleles), and unphased genotypes are represented as rows of 0's, 1's, 2's, and ?'s (where ‘1’ is the heterozygous state and ‘?’ denotes a missing genotype). The SNPs (columns) in the dataset can be partitioned into three disjoint sets: a set T (blue) that is genotyped in all individuals, a set U2 (yellow) that is genotyped in both the haploid and diploid reference panels but not the study sample, and a set U1 (green) that is genotyped only in the haploid reference panel. The goal of imputation in this scenario is to estimate the genotypes of SNPs in set U2 in the study sample and SNPs in the set U1 in both the study sample and, if desired, the diploid reference panel.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2689936&req=5

pgen-1000529-g002: Schematic drawing of imputation Scenario B.In this drawing, haplotypes are represented as horizontal boxes containing 0's and 1's (for alternate SNP alleles), and unphased genotypes are represented as rows of 0's, 1's, 2's, and ?'s (where ‘1’ is the heterozygous state and ‘?’ denotes a missing genotype). The SNPs (columns) in the dataset can be partitioned into three disjoint sets: a set T (blue) that is genotyped in all individuals, a set U2 (yellow) that is genotyped in both the haploid and diploid reference panels but not the study sample, and a set U1 (green) that is genotyped only in the haploid reference panel. The goal of imputation in this scenario is to estimate the genotypes of SNPs in set U2 in the study sample and SNPs in the set U1 in both the study sample and, if desired, the diploid reference panel.

Mentions: One new data configuration, which we call Scenario B and explore in detail in the current study, is presented in Figure 2; we will address other next-generation reference panels in the Discussion. In Scenario B, there are different amounts of genotype data in different cohorts of a study. For example, the Wellcome Trust Case Control Consortium (WTCCC) is currently performing an association study in which 6,000 controls will be genotyped on both the Affymetrix 6.0 and Illumina 1 M SNP chips, whereas disease cohorts will be genotyped only on either the Affymetrix 6.0 chip or the Illumina 670 k chip. In other words, a large set of controls will be genotyped at a subset of HapMap SNPs, and each case cohort will be genotyped at a subset of the SNPs typed in the controls. Published studies have already employed this design [14], and it may become more prevalent in the future as common sets of population controls become more widely available.


A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.

Howie BN, Donnelly P, Marchini J - PLoS Genet. (2009)

Schematic drawing of imputation Scenario B.In this drawing, haplotypes are represented as horizontal boxes containing 0's and 1's (for alternate SNP alleles), and unphased genotypes are represented as rows of 0's, 1's, 2's, and ?'s (where ‘1’ is the heterozygous state and ‘?’ denotes a missing genotype). The SNPs (columns) in the dataset can be partitioned into three disjoint sets: a set T (blue) that is genotyped in all individuals, a set U2 (yellow) that is genotyped in both the haploid and diploid reference panels but not the study sample, and a set U1 (green) that is genotyped only in the haploid reference panel. The goal of imputation in this scenario is to estimate the genotypes of SNPs in set U2 in the study sample and SNPs in the set U1 in both the study sample and, if desired, the diploid reference panel.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2689936&req=5

pgen-1000529-g002: Schematic drawing of imputation Scenario B.In this drawing, haplotypes are represented as horizontal boxes containing 0's and 1's (for alternate SNP alleles), and unphased genotypes are represented as rows of 0's, 1's, 2's, and ?'s (where ‘1’ is the heterozygous state and ‘?’ denotes a missing genotype). The SNPs (columns) in the dataset can be partitioned into three disjoint sets: a set T (blue) that is genotyped in all individuals, a set U2 (yellow) that is genotyped in both the haploid and diploid reference panels but not the study sample, and a set U1 (green) that is genotyped only in the haploid reference panel. The goal of imputation in this scenario is to estimate the genotypes of SNPs in set U2 in the study sample and SNPs in the set U1 in both the study sample and, if desired, the diploid reference panel.
Mentions: One new data configuration, which we call Scenario B and explore in detail in the current study, is presented in Figure 2; we will address other next-generation reference panels in the Discussion. In Scenario B, there are different amounts of genotype data in different cohorts of a study. For example, the Wellcome Trust Case Control Consortium (WTCCC) is currently performing an association study in which 6,000 controls will be genotyped on both the Affymetrix 6.0 and Illumina 1 M SNP chips, whereas disease cohorts will be genotyped only on either the Affymetrix 6.0 chip or the Illumina 670 k chip. In other words, a large set of controls will be genotyped at a subset of HapMap SNPs, and each case cohort will be genotyped at a subset of the SNPs typed in the controls. Published studies have already employed this design [14], and it may become more prevalent in the future as common sets of population controls become more widely available.

Bottom Line: We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made.We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%-20% lower than those of the closest competing method.One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics, University of Oxford, Oxford, UK.

ABSTRACT
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%-20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.

Show MeSH