Limits...
Reproducibility of Variant Calls in Replicate Next Generation Sequencing Experiments.

Qi Y, Liu X, Liu CG, Wang B, Hess KR, Symmans WF, Shi W, Pusztai L - PLoS ONE (2015)

Bottom Line: The type of nucleotide substitution and genomic location of the variant had little impact on concordance but concordance increased with coverage level, variant allele count (VAC), variant allele frequency (VAF), variant allele quality and p-value of SNV-call.The most important determinants of concordance were VAC and VAF.The sequence data have been deposited into the European Genome-phenome Archive (EGA) with accession number EGAS00001000826.

View Article: PubMed Central - PubMed

Affiliation: Departments of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America.

ABSTRACT
Nucleotide alterations detected by next generation sequencing are not always true biological changes but could represent sequencing errors. Even highly accurate methods can yield substantial error rates when applied to millions of nucleotides. In this study, we examined the reproducibility of nucleotide variant calls in replicate sequencing experiments of the same genomic DNA. We performed targeted sequencing of all known human protein kinase genes (kinome) (~3.2 Mb) using the SOLiD v4 platform. Seventeen breast cancer samples were sequenced in duplicate (n=14) or triplicate (n=3) to assess concordance of all calls and single nucleotide variant (SNV) calls. The concordance rates over the entire sequenced region were >99.99%, while the concordance rates for SNVs were 54.3-75.5%. There was substantial variation in basic sequencing metrics from experiment to experiment. The type of nucleotide substitution and genomic location of the variant had little impact on concordance but concordance increased with coverage level, variant allele count (VAC), variant allele frequency (VAF), variant allele quality and p-value of SNV-call. The most important determinants of concordance were VAC and VAF. Even using the highest stringency of QC metrics the reproducibility of SNV calls was around 80% suggesting that erroneous variant calling can be as high as 20-40% in a single experiment. The sequence data have been deposited into the European Genome-phenome Archive (EGA) with accession number EGAS00001000826.

No MeSH data available.


Related in: MedlinePlus

Barplots and boxplots showing variations in basic sequencing metrics between replicated samples and batches.Barplots of (A) the number of total reads (i.e. number of reads of F3- and F5-tagged paired reads), and (B) percentage of mapped reads in target region for each of the replicate pairs. The boxplots show the batch-to-batch differences in the number of total reads (C), the percentage of mapped reads in the target region (D), the average coverage (E), and the percentage of nucleic acids with ≥20x coverage within the target region (F).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4489803&req=5

pone.0119230.g001: Barplots and boxplots showing variations in basic sequencing metrics between replicated samples and batches.Barplots of (A) the number of total reads (i.e. number of reads of F3- and F5-tagged paired reads), and (B) percentage of mapped reads in target region for each of the replicate pairs. The boxplots show the batch-to-batch differences in the number of total reads (C), the percentage of mapped reads in the target region (D), the average coverage (E), and the percentage of nucleic acids with ≥20x coverage within the target region (F).

Mentions: The basic sequencing metrics varied between samples and replicated experiments. The number or total pairs of reads in the 37 experiments ranged from 22.4 to 54.3 million and the percentage of reads that mapped to targeted regions, defined as regions in the BED file provided by Agilent as their designed target, ranged from 43.22% to 70.43%. Fig 1A and 1B show bar graphs of the number of read pairs (i.e. number of F3- and F5-tagged paired reads) and the percentage in target region for each of the replicated samples. Coverage depth (i.e. number of the reads at a given nucleotide position) ranged from 0 to 32,100 for individual nucleotide positions and the average coverage depth ranged from 156 to 631 across the 37 experiments. Between each of the seven sequencing batches, the number of total read pairs, the percentage of reads mapped to a targeted region, the average depth of coverage, and the percentage of nucleotides with >20x coverage in the target region all showed relatively large variations (Fig 1C–1F). Table 1 shows the percentage of nucleotides in the targeted region with >1x and >20x coverage, respectively, for each replicate. The percent of nucleotides in the targeted region with >20x coverage ranged from 77–93% in individual experiments. The Pearson correlation coefficients for coverage depth for all nucleotide positions that had at least one read in both replicate pairs ranged from 0.29 to 0.77.


Reproducibility of Variant Calls in Replicate Next Generation Sequencing Experiments.

Qi Y, Liu X, Liu CG, Wang B, Hess KR, Symmans WF, Shi W, Pusztai L - PLoS ONE (2015)

Barplots and boxplots showing variations in basic sequencing metrics between replicated samples and batches.Barplots of (A) the number of total reads (i.e. number of reads of F3- and F5-tagged paired reads), and (B) percentage of mapped reads in target region for each of the replicate pairs. The boxplots show the batch-to-batch differences in the number of total reads (C), the percentage of mapped reads in the target region (D), the average coverage (E), and the percentage of nucleic acids with ≥20x coverage within the target region (F).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4489803&req=5

pone.0119230.g001: Barplots and boxplots showing variations in basic sequencing metrics between replicated samples and batches.Barplots of (A) the number of total reads (i.e. number of reads of F3- and F5-tagged paired reads), and (B) percentage of mapped reads in target region for each of the replicate pairs. The boxplots show the batch-to-batch differences in the number of total reads (C), the percentage of mapped reads in the target region (D), the average coverage (E), and the percentage of nucleic acids with ≥20x coverage within the target region (F).
Mentions: The basic sequencing metrics varied between samples and replicated experiments. The number or total pairs of reads in the 37 experiments ranged from 22.4 to 54.3 million and the percentage of reads that mapped to targeted regions, defined as regions in the BED file provided by Agilent as their designed target, ranged from 43.22% to 70.43%. Fig 1A and 1B show bar graphs of the number of read pairs (i.e. number of F3- and F5-tagged paired reads) and the percentage in target region for each of the replicated samples. Coverage depth (i.e. number of the reads at a given nucleotide position) ranged from 0 to 32,100 for individual nucleotide positions and the average coverage depth ranged from 156 to 631 across the 37 experiments. Between each of the seven sequencing batches, the number of total read pairs, the percentage of reads mapped to a targeted region, the average depth of coverage, and the percentage of nucleotides with >20x coverage in the target region all showed relatively large variations (Fig 1C–1F). Table 1 shows the percentage of nucleotides in the targeted region with >1x and >20x coverage, respectively, for each replicate. The percent of nucleotides in the targeted region with >20x coverage ranged from 77–93% in individual experiments. The Pearson correlation coefficients for coverage depth for all nucleotide positions that had at least one read in both replicate pairs ranged from 0.29 to 0.77.

Bottom Line: The type of nucleotide substitution and genomic location of the variant had little impact on concordance but concordance increased with coverage level, variant allele count (VAC), variant allele frequency (VAF), variant allele quality and p-value of SNV-call.The most important determinants of concordance were VAC and VAF.The sequence data have been deposited into the European Genome-phenome Archive (EGA) with accession number EGAS00001000826.

View Article: PubMed Central - PubMed

Affiliation: Departments of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America.

ABSTRACT
Nucleotide alterations detected by next generation sequencing are not always true biological changes but could represent sequencing errors. Even highly accurate methods can yield substantial error rates when applied to millions of nucleotides. In this study, we examined the reproducibility of nucleotide variant calls in replicate sequencing experiments of the same genomic DNA. We performed targeted sequencing of all known human protein kinase genes (kinome) (~3.2 Mb) using the SOLiD v4 platform. Seventeen breast cancer samples were sequenced in duplicate (n=14) or triplicate (n=3) to assess concordance of all calls and single nucleotide variant (SNV) calls. The concordance rates over the entire sequenced region were >99.99%, while the concordance rates for SNVs were 54.3-75.5%. There was substantial variation in basic sequencing metrics from experiment to experiment. The type of nucleotide substitution and genomic location of the variant had little impact on concordance but concordance increased with coverage level, variant allele count (VAC), variant allele frequency (VAF), variant allele quality and p-value of SNV-call. The most important determinants of concordance were VAC and VAF. Even using the highest stringency of QC metrics the reproducibility of SNV calls was around 80% suggesting that erroneous variant calling can be as high as 20-40% in a single experiment. The sequence data have been deposited into the European Genome-phenome Archive (EGA) with accession number EGAS00001000826.

No MeSH data available.


Related in: MedlinePlus