Limits...
VariantMetaCaller: automated fusion of variant calling pipelines for quantitative, precision-based filtering.

Gézsi A, Bolgár B, Marx P, Sarkozy P, Szalai C, Antal P - BMC Genomics (2015)

Bottom Line: This novel method had significantly higher sensitivity and precision than the individual variant callers in all target region sizes, ranging from a few hundred kilobases to whole exomes.Specifically, the computed probabilities of the variants can be used to order the variants, and for a given threshold, probabilities can be used to estimate precision.VariantMetaCaller can be applied to small target regions and whole exomes as well, and it can be used in cases of organisms for which highly accurate variant call sets are not yet available, therefore it can be a viable alternative to hard filtering in cases where variant quality score recalibration cannot be used.

View Article: PubMed Central - PubMed

Affiliation: Department of Genetics, Cell- and Immunobiology, Semmelweis University, Nagyvárad tér 4, Budapest, H-1089, Hungary. gezsi.andras@gmail.com.

ABSTRACT

Background: The low concordance between different variant calling methods still poses a challenge for the wide-spread application of next-generation sequencing in research and clinical practice. A wide range of variant annotations can be used for filtering call sets in order to improve the precision of the variant calls, but the choice of the appropriate filtering thresholds is not straightforward. Variant quality score recalibration provides an alternative solution to hard filtering, but it requires large-scale, genomic data.

Results: We evaluated germline variant calling pipelines based on BWA and Bowtie 2 aligners in combination with GATK UnifiedGenotyper, GATK HaplotypeCaller, FreeBayes and SAMtools variant callers, using simulated and real benchmark sequencing data (NA12878 with Illumina Platinum Genomes). We argue that these pipelines are not merely discordant, but they extract complementary useful information. We introduce VariantMetaCaller to test the hypothesis that the automated fusion of measurement related information allows better performance than the recommended hard-filtering settings or recalibration and the fusion of the individual call sets without using annotations. VariantMetaCaller uses Support Vector Machines to combine multiple information sources generated by variant calling pipelines and estimates probabilities of variants. This novel method had significantly higher sensitivity and precision than the individual variant callers in all target region sizes, ranging from a few hundred kilobases to whole exomes. We also demonstrated that VariantMetaCaller supports a quantitative, precision based filtering of variants under wider conditions. Specifically, the computed probabilities of the variants can be used to order the variants, and for a given threshold, probabilities can be used to estimate precision. Precision then can be directly translated to the number of true called variants, or equivalently, to the number of false calls, which allows finding problem-specific balance between sensitivity and precision.

Conclusions: VariantMetaCaller can be applied to small target regions and whole exomes as well, and it can be used in cases of organisms for which highly accurate variant call sets are not yet available, therefore it can be a viable alternative to hard filtering in cases where variant quality score recalibration cannot be used. VariantMetaCaller is freely available at http://bioinformatics.mit.bme.hu/VariantMetaCaller .

No MeSH data available.


Related in: MedlinePlus

Fraction of all, true and false variants called by a different number of variant callers in case of simulated data. Sequencing reads covering the exonic region of a selected chromosome were simulated for 50 artificially generated samples with pre-known variations to the human genome (i.e. reference variants). Variants were called on the BWA–MEM and Bowtie 2 aligned reads by HaplotypeCaller, UnifiedGenotyper, FreeBayes and SAMtools. Stacked bars with different colors represent the fraction of all (a), true (b) and false (c) variants with respect to the reference variants, called by a given number of variant callers at various coverage depths (see the common legend on the bottom). Each panel is divided into four subpanels, where the top pair represents: SNPs, bottom pair: indels, left column: BWA alignment, right-column: Bowtie 2 alignment
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4625715&req=5

Fig2: Fraction of all, true and false variants called by a different number of variant callers in case of simulated data. Sequencing reads covering the exonic region of a selected chromosome were simulated for 50 artificially generated samples with pre-known variations to the human genome (i.e. reference variants). Variants were called on the BWA–MEM and Bowtie 2 aligned reads by HaplotypeCaller, UnifiedGenotyper, FreeBayes and SAMtools. Stacked bars with different colors represent the fraction of all (a), true (b) and false (c) variants with respect to the reference variants, called by a given number of variant callers at various coverage depths (see the common legend on the bottom). Each panel is divided into four subpanels, where the top pair represents: SNPs, bottom pair: indels, left column: BWA alignment, right-column: Bowtie 2 alignment

Mentions: First, we quantified the concordance rates of the individual variant callers by counting the number of methods calling a given variant. The percentage of concordantly called variants by all four variant callers were considerably higher for SNPs than for indels (Fig. 2). In case of SNPs, the percentage of concordant variant calls roughly increased from approximately 78−80 % seen in low coverage to 90−95 % in high coverage, depending on the aligner. Conversely, the percentage of singly-called variants roughly decreased with increasing coverage, from approximately 7−10 % in low coverage to 1−2 % in high coverage (Fig. 2a). At low depths, the frequency of the singly-called variants was the second highest, but with increasing coverage, this category became the least frequent.Fig. 2


VariantMetaCaller: automated fusion of variant calling pipelines for quantitative, precision-based filtering.

Gézsi A, Bolgár B, Marx P, Sarkozy P, Szalai C, Antal P - BMC Genomics (2015)

Fraction of all, true and false variants called by a different number of variant callers in case of simulated data. Sequencing reads covering the exonic region of a selected chromosome were simulated for 50 artificially generated samples with pre-known variations to the human genome (i.e. reference variants). Variants were called on the BWA–MEM and Bowtie 2 aligned reads by HaplotypeCaller, UnifiedGenotyper, FreeBayes and SAMtools. Stacked bars with different colors represent the fraction of all (a), true (b) and false (c) variants with respect to the reference variants, called by a given number of variant callers at various coverage depths (see the common legend on the bottom). Each panel is divided into four subpanels, where the top pair represents: SNPs, bottom pair: indels, left column: BWA alignment, right-column: Bowtie 2 alignment
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4625715&req=5

Fig2: Fraction of all, true and false variants called by a different number of variant callers in case of simulated data. Sequencing reads covering the exonic region of a selected chromosome were simulated for 50 artificially generated samples with pre-known variations to the human genome (i.e. reference variants). Variants were called on the BWA–MEM and Bowtie 2 aligned reads by HaplotypeCaller, UnifiedGenotyper, FreeBayes and SAMtools. Stacked bars with different colors represent the fraction of all (a), true (b) and false (c) variants with respect to the reference variants, called by a given number of variant callers at various coverage depths (see the common legend on the bottom). Each panel is divided into four subpanels, where the top pair represents: SNPs, bottom pair: indels, left column: BWA alignment, right-column: Bowtie 2 alignment
Mentions: First, we quantified the concordance rates of the individual variant callers by counting the number of methods calling a given variant. The percentage of concordantly called variants by all four variant callers were considerably higher for SNPs than for indels (Fig. 2). In case of SNPs, the percentage of concordant variant calls roughly increased from approximately 78−80 % seen in low coverage to 90−95 % in high coverage, depending on the aligner. Conversely, the percentage of singly-called variants roughly decreased with increasing coverage, from approximately 7−10 % in low coverage to 1−2 % in high coverage (Fig. 2a). At low depths, the frequency of the singly-called variants was the second highest, but with increasing coverage, this category became the least frequent.Fig. 2

Bottom Line: This novel method had significantly higher sensitivity and precision than the individual variant callers in all target region sizes, ranging from a few hundred kilobases to whole exomes.Specifically, the computed probabilities of the variants can be used to order the variants, and for a given threshold, probabilities can be used to estimate precision.VariantMetaCaller can be applied to small target regions and whole exomes as well, and it can be used in cases of organisms for which highly accurate variant call sets are not yet available, therefore it can be a viable alternative to hard filtering in cases where variant quality score recalibration cannot be used.

View Article: PubMed Central - PubMed

Affiliation: Department of Genetics, Cell- and Immunobiology, Semmelweis University, Nagyvárad tér 4, Budapest, H-1089, Hungary. gezsi.andras@gmail.com.

ABSTRACT

Background: The low concordance between different variant calling methods still poses a challenge for the wide-spread application of next-generation sequencing in research and clinical practice. A wide range of variant annotations can be used for filtering call sets in order to improve the precision of the variant calls, but the choice of the appropriate filtering thresholds is not straightforward. Variant quality score recalibration provides an alternative solution to hard filtering, but it requires large-scale, genomic data.

Results: We evaluated germline variant calling pipelines based on BWA and Bowtie 2 aligners in combination with GATK UnifiedGenotyper, GATK HaplotypeCaller, FreeBayes and SAMtools variant callers, using simulated and real benchmark sequencing data (NA12878 with Illumina Platinum Genomes). We argue that these pipelines are not merely discordant, but they extract complementary useful information. We introduce VariantMetaCaller to test the hypothesis that the automated fusion of measurement related information allows better performance than the recommended hard-filtering settings or recalibration and the fusion of the individual call sets without using annotations. VariantMetaCaller uses Support Vector Machines to combine multiple information sources generated by variant calling pipelines and estimates probabilities of variants. This novel method had significantly higher sensitivity and precision than the individual variant callers in all target region sizes, ranging from a few hundred kilobases to whole exomes. We also demonstrated that VariantMetaCaller supports a quantitative, precision based filtering of variants under wider conditions. Specifically, the computed probabilities of the variants can be used to order the variants, and for a given threshold, probabilities can be used to estimate precision. Precision then can be directly translated to the number of true called variants, or equivalently, to the number of false calls, which allows finding problem-specific balance between sensitivity and precision.

Conclusions: VariantMetaCaller can be applied to small target regions and whole exomes as well, and it can be used in cases of organisms for which highly accurate variant call sets are not yet available, therefore it can be a viable alternative to hard filtering in cases where variant quality score recalibration cannot be used. VariantMetaCaller is freely available at http://bioinformatics.mit.bme.hu/VariantMetaCaller .

No MeSH data available.


Related in: MedlinePlus