Limits...
Imputation and quality control steps for combining multiple genome-wide datasets.

Verma SS, de Andrade M, Tromp G, Kuivaniemi H, Pugh E, Namjou-Khales B, Mukherjee S, Jarvik GP, Kottyan LC, Burt A, Bradford Y, Armstrong GD, Derr K, Crawford DC, Haines JL, Li R, Crosslin D, Ritchie MD - Front Genet (2014)

Bottom Line: A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms.We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets.The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

View Article: PubMed Central - PubMed

Affiliation: Department of Biochemistry and Molecular Biology, Center for Systems Genomics, The Pennsylvania State University Pennsylvania, PA, USA.

ABSTRACT
The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R (2) (estimated correlation between the imputed and true genotypes), and the relationship between allelic R (2) and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

No MeSH data available.


Related in: MedlinePlus

Frequency distribution of “info” quality metric (A,B) and relationship between the “info” score and MAF are shown (C,D). The secondary axis indicates the count of SNPs in each MAF bin (0.01 intervals).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4263197&req=5

Figure 4: Frequency distribution of “info” quality metric (A,B) and relationship between the “info” score and MAF are shown (C,D). The secondary axis indicates the count of SNPs in each MAF bin (0.01 intervals).

Mentions: There are multiple results from imputation that should be evaluated before proceeding with association analyses for imputed SNPs. For instance, it is critical to consider the uncertainty of the imputed genotypes Figure 4A shows the distributions of the information (reported as “info score”) metrics for all variants in the adult imputed datasets and Figure 4B demonstrates the relationship between MAF and imputation quality for all variants in the pediatric imputed datasets by showing average “info” scores plotted in all variants grouped by MAF (bin sizes of 0.1). Although the total number of imputed variants for the adult and pediatric datasets is very similar, it was notable that there were comparatively more markers with low info score in the pediatric dataset. One potential reason for this discrepancy could be due to a much large number of genotyping platforms in the imputations of the pediatric datasets. While the average “info” scores with MAF < 0.05 fall lower than an info score of 0.8 as demonstrated in Figures 4C,D for adults and pediatric data respectively, within higher MAF bins, the average “info” scores increase to approximately 0.9. This metric demonstrates that variants imputed to have low MAF in the study samples are likely to have low MAF in the reference panel. We attempted to not include any monomorphic SNPs in the imputed dataset, our inclusion criteria was to include any imputed SNP that had at least one copy of minor allele. So the reason that we see a lot of SNPs with low “info” score is mostly due to the chosen imputation target and not any procedural error. Although there is no consensus in filtering the imputed datasets based on uncertainty of imputation, we used a variant level filter (info score >0.7) (Lin et al., 2010; Southam et al., 2011) for the downstream analyses. This is a conservative threshold, whereby we are balancing the quantity of lost data with data quality. Other studies may choose to be more liberal (info score >0.3) or even more conservative (info score >0.9).


Imputation and quality control steps for combining multiple genome-wide datasets.

Verma SS, de Andrade M, Tromp G, Kuivaniemi H, Pugh E, Namjou-Khales B, Mukherjee S, Jarvik GP, Kottyan LC, Burt A, Bradford Y, Armstrong GD, Derr K, Crawford DC, Haines JL, Li R, Crosslin D, Ritchie MD - Front Genet (2014)

Frequency distribution of “info” quality metric (A,B) and relationship between the “info” score and MAF are shown (C,D). The secondary axis indicates the count of SNPs in each MAF bin (0.01 intervals).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4263197&req=5

Figure 4: Frequency distribution of “info” quality metric (A,B) and relationship between the “info” score and MAF are shown (C,D). The secondary axis indicates the count of SNPs in each MAF bin (0.01 intervals).
Mentions: There are multiple results from imputation that should be evaluated before proceeding with association analyses for imputed SNPs. For instance, it is critical to consider the uncertainty of the imputed genotypes Figure 4A shows the distributions of the information (reported as “info score”) metrics for all variants in the adult imputed datasets and Figure 4B demonstrates the relationship between MAF and imputation quality for all variants in the pediatric imputed datasets by showing average “info” scores plotted in all variants grouped by MAF (bin sizes of 0.1). Although the total number of imputed variants for the adult and pediatric datasets is very similar, it was notable that there were comparatively more markers with low info score in the pediatric dataset. One potential reason for this discrepancy could be due to a much large number of genotyping platforms in the imputations of the pediatric datasets. While the average “info” scores with MAF < 0.05 fall lower than an info score of 0.8 as demonstrated in Figures 4C,D for adults and pediatric data respectively, within higher MAF bins, the average “info” scores increase to approximately 0.9. This metric demonstrates that variants imputed to have low MAF in the study samples are likely to have low MAF in the reference panel. We attempted to not include any monomorphic SNPs in the imputed dataset, our inclusion criteria was to include any imputed SNP that had at least one copy of minor allele. So the reason that we see a lot of SNPs with low “info” score is mostly due to the chosen imputation target and not any procedural error. Although there is no consensus in filtering the imputed datasets based on uncertainty of imputation, we used a variant level filter (info score >0.7) (Lin et al., 2010; Southam et al., 2011) for the downstream analyses. This is a conservative threshold, whereby we are balancing the quantity of lost data with data quality. Other studies may choose to be more liberal (info score >0.3) or even more conservative (info score >0.9).

Bottom Line: A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms.We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets.The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

View Article: PubMed Central - PubMed

Affiliation: Department of Biochemistry and Molecular Biology, Center for Systems Genomics, The Pennsylvania State University Pennsylvania, PA, USA.

ABSTRACT
The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R (2) (estimated correlation between the imputed and true genotypes), and the relationship between allelic R (2) and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

No MeSH data available.


Related in: MedlinePlus