Limits...
Systematic exploration of error sources in pyrosequencing flowgram data.

Balzer S, Malde K, Jonassen I - Bioinformatics (2011)

Bottom Line: 454 pyrosequencing, by Roche Diagnostics, has emerged as an alternative to Sanger sequencing when it comes to read lengths, performance and cost, but shows higher per-base error rates.In addition to the well-known homopolymer length inaccuracies, we have identified errors likely to originate from other stages of the sequencing process.We use our findings to extend the flowsim pipeline with functionalities to simulate these errors, and thus enable a more realistic simulation of 454 pyrosequencing data with flowsim.

View Article: PubMed Central - PubMed

Affiliation: Institute of Marine Research, P.O. Box 1870, N-5817 Bergen, Norway. susanne.balzer@imr.no

ABSTRACT

Motivation: 454 pyrosequencing, by Roche Diagnostics, has emerged as an alternative to Sanger sequencing when it comes to read lengths, performance and cost, but shows higher per-base error rates. Although there are several tools available for noise removal, targeting different application fields, data interpretation would benefit from a better understanding of the different error types.

Results: By exploring 454 raw data, we quantify to what extent different factors account for sequencing errors. In addition to the well-known homopolymer length inaccuracies, we have identified errors likely to originate from other stages of the sequencing process. We use our findings to extend the flowsim pipeline with functionalities to simulate these errors, and thus enable a more realistic simulation of 454 pyrosequencing data with flowsim.

Availability: The flowsim pipeline is freely available under the General Public License from http://biohaskell.org/Applications/FlowSim.

Contact: susanne.balzer@imr.no.

Show MeSH
Empirical flow values distributions (D.labrax) and derived intervals.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117331&req=5

Figure 1: Empirical flow values distributions (D.labrax) and derived intervals.

Mentions: In a previous work, we derived empirical distributions from Dicentrarchus labrax (sea bass) Titanium data: by mapping 454 data to the originating reference genome (Kuhl et al., 2010), we characterized the distributions of flow values belonging to each homopolymer length (Balzer et al., 2010). These flow value distributions, one distribution per homopolymer length, overlap, causing over- and under-calls (Fig. 1). By examining them in detail, an interesting and hitherto unexplained pattern emerges: the flow value distributions often contain one major peak around the integral value representing the correct homopolymer length, but then also smaller peaks around the neighboring integral values (Figs 1 and 3). Although these neighboring peaks have been observed previously, we have not seen any convincing explanation for them. Hypothesizing that they are caused by errors in the emulsion PCR performed prior to sequencing, we make an attempt to estimate to what extent PCR errors contribute to the overall error rate.Fig. 1.


Systematic exploration of error sources in pyrosequencing flowgram data.

Balzer S, Malde K, Jonassen I - Bioinformatics (2011)

Empirical flow values distributions (D.labrax) and derived intervals.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117331&req=5

Figure 1: Empirical flow values distributions (D.labrax) and derived intervals.
Mentions: In a previous work, we derived empirical distributions from Dicentrarchus labrax (sea bass) Titanium data: by mapping 454 data to the originating reference genome (Kuhl et al., 2010), we characterized the distributions of flow values belonging to each homopolymer length (Balzer et al., 2010). These flow value distributions, one distribution per homopolymer length, overlap, causing over- and under-calls (Fig. 1). By examining them in detail, an interesting and hitherto unexplained pattern emerges: the flow value distributions often contain one major peak around the integral value representing the correct homopolymer length, but then also smaller peaks around the neighboring integral values (Figs 1 and 3). Although these neighboring peaks have been observed previously, we have not seen any convincing explanation for them. Hypothesizing that they are caused by errors in the emulsion PCR performed prior to sequencing, we make an attempt to estimate to what extent PCR errors contribute to the overall error rate.Fig. 1.

Bottom Line: 454 pyrosequencing, by Roche Diagnostics, has emerged as an alternative to Sanger sequencing when it comes to read lengths, performance and cost, but shows higher per-base error rates.In addition to the well-known homopolymer length inaccuracies, we have identified errors likely to originate from other stages of the sequencing process.We use our findings to extend the flowsim pipeline with functionalities to simulate these errors, and thus enable a more realistic simulation of 454 pyrosequencing data with flowsim.

View Article: PubMed Central - PubMed

Affiliation: Institute of Marine Research, P.O. Box 1870, N-5817 Bergen, Norway. susanne.balzer@imr.no

ABSTRACT

Motivation: 454 pyrosequencing, by Roche Diagnostics, has emerged as an alternative to Sanger sequencing when it comes to read lengths, performance and cost, but shows higher per-base error rates. Although there are several tools available for noise removal, targeting different application fields, data interpretation would benefit from a better understanding of the different error types.

Results: By exploring 454 raw data, we quantify to what extent different factors account for sequencing errors. In addition to the well-known homopolymer length inaccuracies, we have identified errors likely to originate from other stages of the sequencing process. We use our findings to extend the flowsim pipeline with functionalities to simulate these errors, and thus enable a more realistic simulation of 454 pyrosequencing data with flowsim.

Availability: The flowsim pipeline is freely available under the General Public License from http://biohaskell.org/Applications/FlowSim.

Contact: susanne.balzer@imr.no.

Show MeSH