Limits...
Validation and assessment of variant calling pipelines for next-generation sequencing.

Pirooznia M, Kramer M, Parla J, Goes FS, Potash JB, McCombie WR, Zandi PP - Hum. Genomics (2014)

Bottom Line: Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling.However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable.Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Psychiatry and Behavioral Sciences, Johns Hopkins University, Baltimore, MD 21205, USA. pzandi@jhsph.edu.

ABSTRACT

Background: The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal.

Results: We developed a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling. We processed 130 subjects from an ongoing whole exome sequencing study through this pipeline. To evaluate the accuracy of each module, we conducted a series of comparisons between the single nucleotide variant (SNV) calls from the NGS data and either gold-standard Sanger sequencing on a total of 700 variants or array genotyping data on a total of 9,935 single-nucleotide polymorphisms. A head to head comparison showed that Genome Analysis Toolkit (GATK) provided more accurate calls than SAMtools (positive predictive value of 92.55% vs. 80.35%, respectively). Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling. GATK HaplotypeCaller algorithm for variant calling outperformed the UnifiedGenotype algorithm. We also showed a relationship between mapping quality, read depth and allele balance, and SNV call accuracy. However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable.

Conclusions: Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses. To enable others to implement and replicate our results, all of our codes are freely available at http://metamoodics.org/wes.

Show MeSH

Related in: MedlinePlus

Modular structure of pipeline for processing next-generation sequencing data. The pipeline contains 4 modules: (1) mapping, (2) filtering, (3) realignment/recalibration, and (4) variant calling. Detailed description is available at http://metamoodics.org/wes.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4129436&req=5

Figure 1: Modular structure of pipeline for processing next-generation sequencing data. The pipeline contains 4 modules: (1) mapping, (2) filtering, (3) realignment/recalibration, and (4) variant calling. Detailed description is available at http://metamoodics.org/wes.

Mentions: We developed a modular pipeline for processing NGS as shown in FigureĀ 1 and described in Additional file 1 and in more detail at our Wiki site (http://metamoodics.org/wes). First, raw read data with well-calibrated base error estimates in fastq format are mapped to the reference genome. The BWA mapping (version 0.7.0) application [26] is used to map reads to the human genome reference, allowing for two mismatches in 30-base seeds, and generate a technology-independent SAM/BAM reference file format [21]. Next, duplicate fragments are marked and eliminated with Picard (version 1.8) (http://picard.sourceforge.net), mapping quality is assessed and low-quality mapped reads are filtered, and paired read information is evaluated to ensure that all mate-pair information is in sync between each read. We then refine the initial alignments by local realignment and identify suspicious regions. Using this information as a covariate along with other technical covariates and known sites of variation, the GATK base quality score recalibration (BQSR) is carried out. Lastly, SNV calling is performed using the recalibrated and realigned BAM files.


Validation and assessment of variant calling pipelines for next-generation sequencing.

Pirooznia M, Kramer M, Parla J, Goes FS, Potash JB, McCombie WR, Zandi PP - Hum. Genomics (2014)

Modular structure of pipeline for processing next-generation sequencing data. The pipeline contains 4 modules: (1) mapping, (2) filtering, (3) realignment/recalibration, and (4) variant calling. Detailed description is available at http://metamoodics.org/wes.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4129436&req=5

Figure 1: Modular structure of pipeline for processing next-generation sequencing data. The pipeline contains 4 modules: (1) mapping, (2) filtering, (3) realignment/recalibration, and (4) variant calling. Detailed description is available at http://metamoodics.org/wes.
Mentions: We developed a modular pipeline for processing NGS as shown in FigureĀ 1 and described in Additional file 1 and in more detail at our Wiki site (http://metamoodics.org/wes). First, raw read data with well-calibrated base error estimates in fastq format are mapped to the reference genome. The BWA mapping (version 0.7.0) application [26] is used to map reads to the human genome reference, allowing for two mismatches in 30-base seeds, and generate a technology-independent SAM/BAM reference file format [21]. Next, duplicate fragments are marked and eliminated with Picard (version 1.8) (http://picard.sourceforge.net), mapping quality is assessed and low-quality mapped reads are filtered, and paired read information is evaluated to ensure that all mate-pair information is in sync between each read. We then refine the initial alignments by local realignment and identify suspicious regions. Using this information as a covariate along with other technical covariates and known sites of variation, the GATK base quality score recalibration (BQSR) is carried out. Lastly, SNV calling is performed using the recalibrated and realigned BAM files.

Bottom Line: Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling.However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable.Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Psychiatry and Behavioral Sciences, Johns Hopkins University, Baltimore, MD 21205, USA. pzandi@jhsph.edu.

ABSTRACT

Background: The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal.

Results: We developed a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling. We processed 130 subjects from an ongoing whole exome sequencing study through this pipeline. To evaluate the accuracy of each module, we conducted a series of comparisons between the single nucleotide variant (SNV) calls from the NGS data and either gold-standard Sanger sequencing on a total of 700 variants or array genotyping data on a total of 9,935 single-nucleotide polymorphisms. A head to head comparison showed that Genome Analysis Toolkit (GATK) provided more accurate calls than SAMtools (positive predictive value of 92.55% vs. 80.35%, respectively). Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling. GATK HaplotypeCaller algorithm for variant calling outperformed the UnifiedGenotype algorithm. We also showed a relationship between mapping quality, read depth and allele balance, and SNV call accuracy. However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable.

Conclusions: Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses. To enable others to implement and replicate our results, all of our codes are freely available at http://metamoodics.org/wes.

Show MeSH
Related in: MedlinePlus