Limits...
Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing.

Gilles A, Meglécz E, Pech N, Ferreira S, Malausa T, Martin JF - BMC Genomics (2011)

Bottom Line: These factors can be described by considering seven variables.No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables.For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Aix-Marseille Université, CNRS, IRD, UMR 6116 - IMEP, Equipe Evolution Génome Environnement, Centre Saint-Charles, Case 36, 3 place Victor Hugo, 13331 Marseille Cedex 3, France.

ABSTRACT

Background: The rapid evolution of 454 GS-FLX sequencing technology has not been accompanied by a reassessment of the quality and accuracy of the sequences obtained. Current strategies for decision-making and error-correction are based on an initial analysis by Huse et al. in 2007, for the older GS20 system based on experimental sequences. We analyze here the quality of 454 sequencing data and identify factors playing a role in sequencing error, through the use of an extensive dataset for Roche control DNA fragments.

Results: We obtained a mean error rate for 454 sequences of 1.07%. More importantly, the error rate is not randomly distributed; it occasionally rose to more than 50% in certain positions, and its distribution was linked to several experimental variables. The main factors related to error are the presence of homopolymers, position in the sequence, size of the sequence and spatial localization in PT plates for insertion and deletion errors. These factors can be described by considering seven variables. No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables.

Conclusions: The pattern identified here calls for the use of internal controls and error-correcting base callers, to correct for errors, when available (e.g. when sequencing amplicons). For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.

Show MeSH

Related in: MedlinePlus

Spatial distribution of error rate variation. For each error type and sequence length, the x-axis represents the spatial location of 454 reads and the y-axis represents the y-coordinates on the PT plate. The results presented in this figure correspond to plate #1. Data for the other two runs is presented in additional file 4. The 15 strips represent the 15 regions. We display separately the four types of error (insertions, deletions, mismatches and ambiguous base calls) and the length of the sequences generated. Colors indicate the ranges of error rates, from 0 to 1 (or the length of the sequences, from 0 to 500), using a sliding window (see materials and methods).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3116506&req=5

Figure 3: Spatial distribution of error rate variation. For each error type and sequence length, the x-axis represents the spatial location of 454 reads and the y-axis represents the y-coordinates on the PT plate. The results presented in this figure correspond to plate #1. Data for the other two runs is presented in additional file 4. The 15 strips represent the 15 regions. We display separately the four types of error (insertions, deletions, mismatches and ambiguous base calls) and the length of the sequences generated. Colors indicate the ranges of error rates, from 0 to 1 (or the length of the sequences, from 0 to 500), using a sliding window (see materials and methods).

Mentions: Given this pattern, the next step in the integration of information is characterizing the effect of bead localization on error rate. In particular, it is useful to consider whether position in a particular region or on the PT plate is linked to error rate. Heterogeneity in error rate as a function of bead location was found for insertions and deletions, whatever the PT plate analyzed. Heterogeneity was observed at both the region and plate scales. More precisely, error rate variation was mostly accounted for by the combination of several variables but, when the distribution of insertion errors fitted a gradient following the Y-axis in each region (Figure 3 and additional file 4), it was not accounted for by the variable Dist.region alone. However, the proportion of the model accounted for by the remaining variables is small (23.01% ± 2.62). Adding the Dist.region to the model increases explanatory power to 76.99% ± 2.62. The situation was similar for extraction of the signal at plate level, with Dist.plate increasing the explanatory power to 77.39% ± 2.12. In summary, all regions had heterogeneous insertion and deletion error rates, but there were conserved gradients along both the x and y axes. Inverse physical gradients were observed for insertions and deletions. The covariation of these error types and sequence length indicates that they are influenced by a single latent variable (Figure 3).


Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing.

Gilles A, Meglécz E, Pech N, Ferreira S, Malausa T, Martin JF - BMC Genomics (2011)

Spatial distribution of error rate variation. For each error type and sequence length, the x-axis represents the spatial location of 454 reads and the y-axis represents the y-coordinates on the PT plate. The results presented in this figure correspond to plate #1. Data for the other two runs is presented in additional file 4. The 15 strips represent the 15 regions. We display separately the four types of error (insertions, deletions, mismatches and ambiguous base calls) and the length of the sequences generated. Colors indicate the ranges of error rates, from 0 to 1 (or the length of the sequences, from 0 to 500), using a sliding window (see materials and methods).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3116506&req=5

Figure 3: Spatial distribution of error rate variation. For each error type and sequence length, the x-axis represents the spatial location of 454 reads and the y-axis represents the y-coordinates on the PT plate. The results presented in this figure correspond to plate #1. Data for the other two runs is presented in additional file 4. The 15 strips represent the 15 regions. We display separately the four types of error (insertions, deletions, mismatches and ambiguous base calls) and the length of the sequences generated. Colors indicate the ranges of error rates, from 0 to 1 (or the length of the sequences, from 0 to 500), using a sliding window (see materials and methods).
Mentions: Given this pattern, the next step in the integration of information is characterizing the effect of bead localization on error rate. In particular, it is useful to consider whether position in a particular region or on the PT plate is linked to error rate. Heterogeneity in error rate as a function of bead location was found for insertions and deletions, whatever the PT plate analyzed. Heterogeneity was observed at both the region and plate scales. More precisely, error rate variation was mostly accounted for by the combination of several variables but, when the distribution of insertion errors fitted a gradient following the Y-axis in each region (Figure 3 and additional file 4), it was not accounted for by the variable Dist.region alone. However, the proportion of the model accounted for by the remaining variables is small (23.01% ± 2.62). Adding the Dist.region to the model increases explanatory power to 76.99% ± 2.62. The situation was similar for extraction of the signal at plate level, with Dist.plate increasing the explanatory power to 77.39% ± 2.12. In summary, all regions had heterogeneous insertion and deletion error rates, but there were conserved gradients along both the x and y axes. Inverse physical gradients were observed for insertions and deletions. The covariation of these error types and sequence length indicates that they are influenced by a single latent variable (Figure 3).

Bottom Line: These factors can be described by considering seven variables.No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables.For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Aix-Marseille Université, CNRS, IRD, UMR 6116 - IMEP, Equipe Evolution Génome Environnement, Centre Saint-Charles, Case 36, 3 place Victor Hugo, 13331 Marseille Cedex 3, France.

ABSTRACT

Background: The rapid evolution of 454 GS-FLX sequencing technology has not been accompanied by a reassessment of the quality and accuracy of the sequences obtained. Current strategies for decision-making and error-correction are based on an initial analysis by Huse et al. in 2007, for the older GS20 system based on experimental sequences. We analyze here the quality of 454 sequencing data and identify factors playing a role in sequencing error, through the use of an extensive dataset for Roche control DNA fragments.

Results: We obtained a mean error rate for 454 sequences of 1.07%. More importantly, the error rate is not randomly distributed; it occasionally rose to more than 50% in certain positions, and its distribution was linked to several experimental variables. The main factors related to error are the presence of homopolymers, position in the sequence, size of the sequence and spatial localization in PT plates for insertion and deletion errors. These factors can be described by considering seven variables. No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables.

Conclusions: The pattern identified here calls for the use of internal controls and error-correcting base callers, to correct for errors, when available (e.g. when sequencing amplicons). For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.

Show MeSH
Related in: MedlinePlus