Limits...
Reproducibility of Variant Calls in Replicate Next Generation Sequencing Experiments.

Qi Y, Liu X, Liu CG, Wang B, Hess KR, Symmans WF, Shi W, Pusztai L - PLoS ONE (2015)

Bottom Line: The type of nucleotide substitution and genomic location of the variant had little impact on concordance but concordance increased with coverage level, variant allele count (VAC), variant allele frequency (VAF), variant allele quality and p-value of SNV-call.The most important determinants of concordance were VAC and VAF.The sequence data have been deposited into the European Genome-phenome Archive (EGA) with accession number EGAS00001000826.

View Article: PubMed Central - PubMed

Affiliation: Departments of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America.

ABSTRACT
Nucleotide alterations detected by next generation sequencing are not always true biological changes but could represent sequencing errors. Even highly accurate methods can yield substantial error rates when applied to millions of nucleotides. In this study, we examined the reproducibility of nucleotide variant calls in replicate sequencing experiments of the same genomic DNA. We performed targeted sequencing of all known human protein kinase genes (kinome) (~3.2 Mb) using the SOLiD v4 platform. Seventeen breast cancer samples were sequenced in duplicate (n=14) or triplicate (n=3) to assess concordance of all calls and single nucleotide variant (SNV) calls. The concordance rates over the entire sequenced region were >99.99%, while the concordance rates for SNVs were 54.3-75.5%. There was substantial variation in basic sequencing metrics from experiment to experiment. The type of nucleotide substitution and genomic location of the variant had little impact on concordance but concordance increased with coverage level, variant allele count (VAC), variant allele frequency (VAF), variant allele quality and p-value of SNV-call. The most important determinants of concordance were VAC and VAF. Even using the highest stringency of QC metrics the reproducibility of SNV calls was around 80% suggesting that erroneous variant calling can be as high as 20-40% in a single experiment. The sequence data have been deposited into the European Genome-phenome Archive (EGA) with accession number EGAS00001000826.

No MeSH data available.


Related in: MedlinePlus

Comparisons between the relative importance of the 5 different variables in determining reproducibility of SNV calls.Importance was assessed using mutual information value (A), Akaike information criterion (B), and Lasso regression methods (C, D). On panels C and D, the y-axis indicates whether a factor is in the model (y = 1) or not (y = 0). VAC = variant allele count, VAF = variant allele frequency, VAQ = variant allele quality and p-value refers to SNP call p-value generated by BioScope.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4489803&req=5

pone.0119230.g004: Comparisons between the relative importance of the 5 different variables in determining reproducibility of SNV calls.Importance was assessed using mutual information value (A), Akaike information criterion (B), and Lasso regression methods (C, D). On panels C and D, the y-axis indicates whether a factor is in the model (y = 1) or not (y = 0). VAC = variant allele count, VAF = variant allele frequency, VAQ = variant allele quality and p-value refers to SNP call p-value generated by BioScope.

Mentions: First, we examined correlation between these 5 factors. Only coverage depth and variant allele count showed a strong linear correlation (S6 Fig). Checking the association of these factors with concordance status (concordant or discordant) was done by using mutual information (the factors were treated as discrete variables) and logistic regression (the factors were treated as continuous variables). Mutual information measures the dependence between two variables, and larger values of mutual information indicate more dependency between the variables [9]. VAC and VAF showed much larger mutual information with concordance status than the other three factors (Fig 4A). The Akaike information criterion (AIC) is often used in model selection, and models with smaller values of AIC are preferred. VAC and VAF were again the two factors with the smallest AIC values while all other factors showed comparable but larger values (Fig 4B).


Reproducibility of Variant Calls in Replicate Next Generation Sequencing Experiments.

Qi Y, Liu X, Liu CG, Wang B, Hess KR, Symmans WF, Shi W, Pusztai L - PLoS ONE (2015)

Comparisons between the relative importance of the 5 different variables in determining reproducibility of SNV calls.Importance was assessed using mutual information value (A), Akaike information criterion (B), and Lasso regression methods (C, D). On panels C and D, the y-axis indicates whether a factor is in the model (y = 1) or not (y = 0). VAC = variant allele count, VAF = variant allele frequency, VAQ = variant allele quality and p-value refers to SNP call p-value generated by BioScope.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4489803&req=5

pone.0119230.g004: Comparisons between the relative importance of the 5 different variables in determining reproducibility of SNV calls.Importance was assessed using mutual information value (A), Akaike information criterion (B), and Lasso regression methods (C, D). On panels C and D, the y-axis indicates whether a factor is in the model (y = 1) or not (y = 0). VAC = variant allele count, VAF = variant allele frequency, VAQ = variant allele quality and p-value refers to SNP call p-value generated by BioScope.
Mentions: First, we examined correlation between these 5 factors. Only coverage depth and variant allele count showed a strong linear correlation (S6 Fig). Checking the association of these factors with concordance status (concordant or discordant) was done by using mutual information (the factors were treated as discrete variables) and logistic regression (the factors were treated as continuous variables). Mutual information measures the dependence between two variables, and larger values of mutual information indicate more dependency between the variables [9]. VAC and VAF showed much larger mutual information with concordance status than the other three factors (Fig 4A). The Akaike information criterion (AIC) is often used in model selection, and models with smaller values of AIC are preferred. VAC and VAF were again the two factors with the smallest AIC values while all other factors showed comparable but larger values (Fig 4B).

Bottom Line: The type of nucleotide substitution and genomic location of the variant had little impact on concordance but concordance increased with coverage level, variant allele count (VAC), variant allele frequency (VAF), variant allele quality and p-value of SNV-call.The most important determinants of concordance were VAC and VAF.The sequence data have been deposited into the European Genome-phenome Archive (EGA) with accession number EGAS00001000826.

View Article: PubMed Central - PubMed

Affiliation: Departments of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America.

ABSTRACT
Nucleotide alterations detected by next generation sequencing are not always true biological changes but could represent sequencing errors. Even highly accurate methods can yield substantial error rates when applied to millions of nucleotides. In this study, we examined the reproducibility of nucleotide variant calls in replicate sequencing experiments of the same genomic DNA. We performed targeted sequencing of all known human protein kinase genes (kinome) (~3.2 Mb) using the SOLiD v4 platform. Seventeen breast cancer samples were sequenced in duplicate (n=14) or triplicate (n=3) to assess concordance of all calls and single nucleotide variant (SNV) calls. The concordance rates over the entire sequenced region were >99.99%, while the concordance rates for SNVs were 54.3-75.5%. There was substantial variation in basic sequencing metrics from experiment to experiment. The type of nucleotide substitution and genomic location of the variant had little impact on concordance but concordance increased with coverage level, variant allele count (VAC), variant allele frequency (VAF), variant allele quality and p-value of SNV-call. The most important determinants of concordance were VAC and VAF. Even using the highest stringency of QC metrics the reproducibility of SNV calls was around 80% suggesting that erroneous variant calling can be as high as 20-40% in a single experiment. The sequence data have been deposited into the European Genome-phenome Archive (EGA) with accession number EGAS00001000826.

No MeSH data available.


Related in: MedlinePlus