Limits...
Imputation and quality control steps for combining multiple genome-wide datasets.

Verma SS, de Andrade M, Tromp G, Kuivaniemi H, Pugh E, Namjou-Khales B, Mukherjee S, Jarvik GP, Kottyan LC, Burt A, Bradford Y, Armstrong GD, Derr K, Crawford DC, Haines JL, Li R, Crosslin D, Ritchie MD - Front Genet (2014)

Bottom Line: A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms.We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets.The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

View Article: PubMed Central - PubMed

Affiliation: Department of Biochemistry and Molecular Biology, Center for Systems Genomics, The Pennsylvania State University Pennsylvania, PA, USA.

ABSTRACT
The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R (2) (estimated correlation between the imputed and true genotypes), and the relationship between allelic R (2) and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

No MeSH data available.


Related in: MedlinePlus

Chromosome segmentation strategy for imputation with IMPUTE2. Each chromosome was divided into 6 MB segments with 250 kbp overlap between them.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4263197&req=5

Figure 2: Chromosome segmentation strategy for imputation with IMPUTE2. Each chromosome was divided into 6 MB segments with 250 kbp overlap between them.

Mentions: To perform imputation with IMPUTE2 on our phased data, we divided each chromosome into base pair regions of approximately 6 Mb in size, beginning at the first imputation target, as displayed in Figure 2. As a result, we partitioned 22 autosomes into 441 segments, ranging from only 7 segments on chromosome 21, to the largest number of segments (39) on chromosome 2. It is of interest to note that there were 36 segments on chromosome 1. It was beneficial to use this process of breaking the genotypic data into smaller regions because IMPUTE2 has been reported to have improved accuracy over smaller genomic regions and also separating data into segments helps allows for the parallelization of jobs over a multi-core compute cluster. Segments either overlapping the centromere or at the terminal ends of chromosomes were merged into the segment immediately upstream.


Imputation and quality control steps for combining multiple genome-wide datasets.

Verma SS, de Andrade M, Tromp G, Kuivaniemi H, Pugh E, Namjou-Khales B, Mukherjee S, Jarvik GP, Kottyan LC, Burt A, Bradford Y, Armstrong GD, Derr K, Crawford DC, Haines JL, Li R, Crosslin D, Ritchie MD - Front Genet (2014)

Chromosome segmentation strategy for imputation with IMPUTE2. Each chromosome was divided into 6 MB segments with 250 kbp overlap between them.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4263197&req=5

Figure 2: Chromosome segmentation strategy for imputation with IMPUTE2. Each chromosome was divided into 6 MB segments with 250 kbp overlap between them.
Mentions: To perform imputation with IMPUTE2 on our phased data, we divided each chromosome into base pair regions of approximately 6 Mb in size, beginning at the first imputation target, as displayed in Figure 2. As a result, we partitioned 22 autosomes into 441 segments, ranging from only 7 segments on chromosome 21, to the largest number of segments (39) on chromosome 2. It is of interest to note that there were 36 segments on chromosome 1. It was beneficial to use this process of breaking the genotypic data into smaller regions because IMPUTE2 has been reported to have improved accuracy over smaller genomic regions and also separating data into segments helps allows for the parallelization of jobs over a multi-core compute cluster. Segments either overlapping the centromere or at the terminal ends of chromosomes were merged into the segment immediately upstream.

Bottom Line: A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms.We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets.The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

View Article: PubMed Central - PubMed

Affiliation: Department of Biochemistry and Molecular Biology, Center for Systems Genomics, The Pennsylvania State University Pennsylvania, PA, USA.

ABSTRACT
The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R (2) (estimated correlation between the imputed and true genotypes), and the relationship between allelic R (2) and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

No MeSH data available.


Related in: MedlinePlus