Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing.
Bottom Line:
These factors can be described by considering seven variables.No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables.For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.
Affiliation: Aix-Marseille Université, CNRS, IRD, UMR 6116 - IMEP, Equipe Evolution Génome Environnement, Centre Saint-Charles, Case 36, 3 place Victor Hugo, 13331 Marseille Cedex 3, France.
ABSTRACT
Show MeSH
Background: The rapid evolution of 454 GS-FLX sequencing technology has not been accompanied by a reassessment of the quality and accuracy of the sequences obtained. Current strategies for decision-making and error-correction are based on an initial analysis by Huse et al. in 2007, for the older GS20 system based on experimental sequences. We analyze here the quality of 454 sequencing data and identify factors playing a role in sequencing error, through the use of an extensive dataset for Roche control DNA fragments. Results: We obtained a mean error rate for 454 sequences of 1.07%. More importantly, the error rate is not randomly distributed; it occasionally rose to more than 50% in certain positions, and its distribution was linked to several experimental variables. The main factors related to error are the presence of homopolymers, position in the sequence, size of the sequence and spatial localization in PT plates for insertion and deletion errors. These factors can be described by considering seven variables. No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables. Conclusions: The pattern identified here calls for the use of internal controls and error-correcting base callers, to correct for errors, when available (e.g. when sequencing amplicons). For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors. Related in: MedlinePlus |
Related In:
Results -
Collection
License getmorefigures.php?uid=PMC3116506&req=5
Mentions: The nature and significance of a correlation between two variables does not provide any information about the ability of this combination of variables to explain a third variable [21]. For each plate and each kind of error, we considered a logistic model [22] (see materials and methods for the detailed procedure) accounting for the binary (error) variable in terms of the seven variables considered. For the separation of the effect of a given explicative variable from the combined effect of the other variables, we propose (see materials and methods) breaking down each explanatory variable into three additive terms: the effect of the variable itself, the combined effect of the other variables and the rest. The combined effect of the variables ranged from 20% to 80% of the total variation in error rate (Figure 2 and additional file 3). More specifically, for individual error types, the combined effect accounted for 38.00% ± 13.05 of the total information for mismatch errors, 64.10% ± 4.54 for ambiguous base call errors, 75.83% ± 3.78 for insertion errors and 79.95% ± 3.08 for deletion errors. The remaining information results from the specific effects of each variable. These high percentages of shared information highlight the high degree to which the error can be explained by combinations of variables. This may be due to partial redundancy of the information contained in each variable or the combined contribution to the total amount of error explained [21]. In the first case, a variable may substitute for the effect of others, whereas, in the second, only the combined information provided by each variable can account for the observed pattern. The results of correlation analysis, indicating that most regression coefficients were low, ruled out redundancy as the primary cause of the observed pattern, as most variables were independent. There is therefore no single variable consistently accounting for the distribution of sequencing error, as detailed in Figure 2. We investigated the main trends highlighted by the logistic model, by focusing on the distribution of sequencing error at sequence level. We then characterized the variables most strongly influencing error in terms of the location of the bead carrying the sequence, in a given region of a PT plate. |
View Article: PubMed Central - HTML - PubMed
Affiliation: Aix-Marseille Université, CNRS, IRD, UMR 6116 - IMEP, Equipe Evolution Génome Environnement, Centre Saint-Charles, Case 36, 3 place Victor Hugo, 13331 Marseille Cedex 3, France.
Background: The rapid evolution of 454 GS-FLX sequencing technology has not been accompanied by a reassessment of the quality and accuracy of the sequences obtained. Current strategies for decision-making and error-correction are based on an initial analysis by Huse et al. in 2007, for the older GS20 system based on experimental sequences. We analyze here the quality of 454 sequencing data and identify factors playing a role in sequencing error, through the use of an extensive dataset for Roche control DNA fragments.
Results: We obtained a mean error rate for 454 sequences of 1.07%. More importantly, the error rate is not randomly distributed; it occasionally rose to more than 50% in certain positions, and its distribution was linked to several experimental variables. The main factors related to error are the presence of homopolymers, position in the sequence, size of the sequence and spatial localization in PT plates for insertion and deletion errors. These factors can be described by considering seven variables. No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables.
Conclusions: The pattern identified here calls for the use of internal controls and error-correcting base callers, to correct for errors, when available (e.g. when sequencing amplicons). For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.