Limits...
Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing.

Gilles A, Meglécz E, Pech N, Ferreira S, Malausa T, Martin JF - BMC Genomics (2011)

Bottom Line: These factors can be described by considering seven variables.No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables.For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Aix-Marseille Université, CNRS, IRD, UMR 6116 - IMEP, Equipe Evolution Génome Environnement, Centre Saint-Charles, Case 36, 3 place Victor Hugo, 13331 Marseille Cedex 3, France.

ABSTRACT

Background: The rapid evolution of 454 GS-FLX sequencing technology has not been accompanied by a reassessment of the quality and accuracy of the sequences obtained. Current strategies for decision-making and error-correction are based on an initial analysis by Huse et al. in 2007, for the older GS20 system based on experimental sequences. We analyze here the quality of 454 sequencing data and identify factors playing a role in sequencing error, through the use of an extensive dataset for Roche control DNA fragments.

Results: We obtained a mean error rate for 454 sequences of 1.07%. More importantly, the error rate is not randomly distributed; it occasionally rose to more than 50% in certain positions, and its distribution was linked to several experimental variables. The main factors related to error are the presence of homopolymers, position in the sequence, size of the sequence and spatial localization in PT plates for insertion and deletion errors. These factors can be described by considering seven variables. No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables.

Conclusions: The pattern identified here calls for the use of internal controls and error-correcting base callers, to correct for errors, when available (e.g. when sequencing amplicons). For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.

Show MeSH

Related in: MedlinePlus

Decomposition of error rate variation, using all available variables. For each plate, we used a logistic model to decipher the role of each selected variable and its contribution to error rate (see materials and methods). The error rate has been broken down as a function of error type: a) insertions, b) deletions, c) mismatches and d) ambiguous base calls. We tested the deviance from the complete model by breaking down the complete model into the sum of three terms: the first exclusive to the single effect of the variable considered (in black), the second exclusive effect of the rest of the variables without the variable of interest (in gray) and the last expressing the sum of the effects of interactions between the variable considered and the other variables (in white). The contribution of each term (the proportion) for a considered variable can be viewed on the y-axis. We display only the results for plate #1 (the results for the other plates are presented in additional file 3).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3116506&req=5

Figure 2: Decomposition of error rate variation, using all available variables. For each plate, we used a logistic model to decipher the role of each selected variable and its contribution to error rate (see materials and methods). The error rate has been broken down as a function of error type: a) insertions, b) deletions, c) mismatches and d) ambiguous base calls. We tested the deviance from the complete model by breaking down the complete model into the sum of three terms: the first exclusive to the single effect of the variable considered (in black), the second exclusive effect of the rest of the variables without the variable of interest (in gray) and the last expressing the sum of the effects of interactions between the variable considered and the other variables (in white). The contribution of each term (the proportion) for a considered variable can be viewed on the y-axis. We display only the results for plate #1 (the results for the other plates are presented in additional file 3).

Mentions: The nature and significance of a correlation between two variables does not provide any information about the ability of this combination of variables to explain a third variable [21]. For each plate and each kind of error, we considered a logistic model [22] (see materials and methods for the detailed procedure) accounting for the binary (error) variable in terms of the seven variables considered. For the separation of the effect of a given explicative variable from the combined effect of the other variables, we propose (see materials and methods) breaking down each explanatory variable into three additive terms: the effect of the variable itself, the combined effect of the other variables and the rest. The combined effect of the variables ranged from 20% to 80% of the total variation in error rate (Figure 2 and additional file 3). More specifically, for individual error types, the combined effect accounted for 38.00% ± 13.05 of the total information for mismatch errors, 64.10% ± 4.54 for ambiguous base call errors, 75.83% ± 3.78 for insertion errors and 79.95% ± 3.08 for deletion errors. The remaining information results from the specific effects of each variable. These high percentages of shared information highlight the high degree to which the error can be explained by combinations of variables. This may be due to partial redundancy of the information contained in each variable or the combined contribution to the total amount of error explained [21]. In the first case, a variable may substitute for the effect of others, whereas, in the second, only the combined information provided by each variable can account for the observed pattern. The results of correlation analysis, indicating that most regression coefficients were low, ruled out redundancy as the primary cause of the observed pattern, as most variables were independent. There is therefore no single variable consistently accounting for the distribution of sequencing error, as detailed in Figure 2. We investigated the main trends highlighted by the logistic model, by focusing on the distribution of sequencing error at sequence level. We then characterized the variables most strongly influencing error in terms of the location of the bead carrying the sequence, in a given region of a PT plate.


Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing.

Gilles A, Meglécz E, Pech N, Ferreira S, Malausa T, Martin JF - BMC Genomics (2011)

Decomposition of error rate variation, using all available variables. For each plate, we used a logistic model to decipher the role of each selected variable and its contribution to error rate (see materials and methods). The error rate has been broken down as a function of error type: a) insertions, b) deletions, c) mismatches and d) ambiguous base calls. We tested the deviance from the complete model by breaking down the complete model into the sum of three terms: the first exclusive to the single effect of the variable considered (in black), the second exclusive effect of the rest of the variables without the variable of interest (in gray) and the last expressing the sum of the effects of interactions between the variable considered and the other variables (in white). The contribution of each term (the proportion) for a considered variable can be viewed on the y-axis. We display only the results for plate #1 (the results for the other plates are presented in additional file 3).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3116506&req=5

Figure 2: Decomposition of error rate variation, using all available variables. For each plate, we used a logistic model to decipher the role of each selected variable and its contribution to error rate (see materials and methods). The error rate has been broken down as a function of error type: a) insertions, b) deletions, c) mismatches and d) ambiguous base calls. We tested the deviance from the complete model by breaking down the complete model into the sum of three terms: the first exclusive to the single effect of the variable considered (in black), the second exclusive effect of the rest of the variables without the variable of interest (in gray) and the last expressing the sum of the effects of interactions between the variable considered and the other variables (in white). The contribution of each term (the proportion) for a considered variable can be viewed on the y-axis. We display only the results for plate #1 (the results for the other plates are presented in additional file 3).
Mentions: The nature and significance of a correlation between two variables does not provide any information about the ability of this combination of variables to explain a third variable [21]. For each plate and each kind of error, we considered a logistic model [22] (see materials and methods for the detailed procedure) accounting for the binary (error) variable in terms of the seven variables considered. For the separation of the effect of a given explicative variable from the combined effect of the other variables, we propose (see materials and methods) breaking down each explanatory variable into three additive terms: the effect of the variable itself, the combined effect of the other variables and the rest. The combined effect of the variables ranged from 20% to 80% of the total variation in error rate (Figure 2 and additional file 3). More specifically, for individual error types, the combined effect accounted for 38.00% ± 13.05 of the total information for mismatch errors, 64.10% ± 4.54 for ambiguous base call errors, 75.83% ± 3.78 for insertion errors and 79.95% ± 3.08 for deletion errors. The remaining information results from the specific effects of each variable. These high percentages of shared information highlight the high degree to which the error can be explained by combinations of variables. This may be due to partial redundancy of the information contained in each variable or the combined contribution to the total amount of error explained [21]. In the first case, a variable may substitute for the effect of others, whereas, in the second, only the combined information provided by each variable can account for the observed pattern. The results of correlation analysis, indicating that most regression coefficients were low, ruled out redundancy as the primary cause of the observed pattern, as most variables were independent. There is therefore no single variable consistently accounting for the distribution of sequencing error, as detailed in Figure 2. We investigated the main trends highlighted by the logistic model, by focusing on the distribution of sequencing error at sequence level. We then characterized the variables most strongly influencing error in terms of the location of the bead carrying the sequence, in a given region of a PT plate.

Bottom Line: These factors can be described by considering seven variables.No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables.For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Aix-Marseille Université, CNRS, IRD, UMR 6116 - IMEP, Equipe Evolution Génome Environnement, Centre Saint-Charles, Case 36, 3 place Victor Hugo, 13331 Marseille Cedex 3, France.

ABSTRACT

Background: The rapid evolution of 454 GS-FLX sequencing technology has not been accompanied by a reassessment of the quality and accuracy of the sequences obtained. Current strategies for decision-making and error-correction are based on an initial analysis by Huse et al. in 2007, for the older GS20 system based on experimental sequences. We analyze here the quality of 454 sequencing data and identify factors playing a role in sequencing error, through the use of an extensive dataset for Roche control DNA fragments.

Results: We obtained a mean error rate for 454 sequences of 1.07%. More importantly, the error rate is not randomly distributed; it occasionally rose to more than 50% in certain positions, and its distribution was linked to several experimental variables. The main factors related to error are the presence of homopolymers, position in the sequence, size of the sequence and spatial localization in PT plates for insertion and deletion errors. These factors can be described by considering seven variables. No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables.

Conclusions: The pattern identified here calls for the use of internal controls and error-correcting base callers, to correct for errors, when available (e.g. when sequencing amplicons). For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.

Show MeSH
Related in: MedlinePlus