Limits...
Empirical assessment of sequencing errors for high throughput pyrosequencing data.

da Fonseca PG, Paiva JA, Almeida LG, Vasconcelos AT, Freitas AT - BMC Res Notes (2013)

Bottom Line: We also compared two models previously employed with success for peptide sequence alignment.As with protein alignments, a power-law model seems to approximate the indel errors more accurately, although the results are not so conclusive as to justify a depart from the commonly used affine gap penalty scheme.In whichever case, however, our procedure can be used to estimate more realistic error model parameters.

View Article: PubMed Central - HTML - PubMed

Affiliation: Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), R, Alves Redol 9, Lisboa 1000-029, Portugal. pgsf@kdbio.inesc-id.pt.

ABSTRACT

Background: Sequencing-by-synthesis technologies significantly improve over the Sanger method in terms of speed and cost per base. However, they still usually fail to compete in terms of read length and quality. Current high-throughput implementations of the pyrosequencing technique yield reads whose length approach those of the capillary electrophoresis method. A less obvious question is whether their quality is affected by platform-specific sequencing errors.

Results: We present an empirical study aimed at assessing the quality and characterising sequencing errors for high throughput pyrosequencing data. We have developed a procedure for extracting sequencing error data from genome assemblies and study their characteristics, in particular the length distribution of indel gaps and their relation to the sequence contexts where they occur. We used this procedure to analyse data from three prokaryotic genomes sequenced with the GS FLX technology. We also compared two models previously employed with success for peptide sequence alignment.

Conclusions: We observed an overall very low error rate in the analysed data, with indel errors being much more abundant than substitutions. We also observed a dependence between the length of the gaps and that of the homopolymer context where they occur. As with protein alignments, a power-law model seems to approximate the indel errors more accurately, although the results are not so conclusive as to justify a depart from the commonly used affine gap penalty scheme. In whichever case, however, our procedure can be used to estimate more realistic error model parameters.

Show MeSH

Related in: MedlinePlus

Indel gaps per context. (a) Absolute number of gaps per context (left: insertions, right: deletions). Contexts are represented in the form αℓ, where α indicates the base of the homopolymer and ℓ indicates its length (e.g. T5=TTTTT). (b) Ratio between the number occurrences of an homopolymer as a context of a gap and its number of exact match alignments (ln the x-axis, the contexts are listed in the order A1, C1, G1, T1, A2, C2, G2, T2,...).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3852801&req=5

Figure 2: Indel gaps per context. (a) Absolute number of gaps per context (left: insertions, right: deletions). Contexts are represented in the form αℓ, where α indicates the base of the homopolymer and ℓ indicates its length (e.g. T5=TTTTT). (b) Ratio between the number occurrences of an homopolymer as a context of a gap and its number of exact match alignments (ln the x-axis, the contexts are listed in the order A1, C1, G1, T1, A2, C2, G2, T2,...).

Mentions: Next, we examine the number of indel errors per context, which are illustrated in Figure2(a). We notice that the absolute number of insertion gaps decreases with the size of the context, which is not surprising, given the relatively lower number of large contexts themselves. Indeed, if we compute the ratio between the number of occurrences of a given homopolymer as an indel context and the number of its exact match alignments, we observe a clear tendency for the chance of having a gap in a homopolymer to increase with its length, as can be seen in Figure2(b). Data shown in Figure2 concern the Newbler assembler. The equivalent graphics for the WGS assembler reveal qualitatively similar patterns and can be found in the Additional file2.


Empirical assessment of sequencing errors for high throughput pyrosequencing data.

da Fonseca PG, Paiva JA, Almeida LG, Vasconcelos AT, Freitas AT - BMC Res Notes (2013)

Indel gaps per context. (a) Absolute number of gaps per context (left: insertions, right: deletions). Contexts are represented in the form αℓ, where α indicates the base of the homopolymer and ℓ indicates its length (e.g. T5=TTTTT). (b) Ratio between the number occurrences of an homopolymer as a context of a gap and its number of exact match alignments (ln the x-axis, the contexts are listed in the order A1, C1, G1, T1, A2, C2, G2, T2,...).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3852801&req=5

Figure 2: Indel gaps per context. (a) Absolute number of gaps per context (left: insertions, right: deletions). Contexts are represented in the form αℓ, where α indicates the base of the homopolymer and ℓ indicates its length (e.g. T5=TTTTT). (b) Ratio between the number occurrences of an homopolymer as a context of a gap and its number of exact match alignments (ln the x-axis, the contexts are listed in the order A1, C1, G1, T1, A2, C2, G2, T2,...).
Mentions: Next, we examine the number of indel errors per context, which are illustrated in Figure2(a). We notice that the absolute number of insertion gaps decreases with the size of the context, which is not surprising, given the relatively lower number of large contexts themselves. Indeed, if we compute the ratio between the number of occurrences of a given homopolymer as an indel context and the number of its exact match alignments, we observe a clear tendency for the chance of having a gap in a homopolymer to increase with its length, as can be seen in Figure2(b). Data shown in Figure2 concern the Newbler assembler. The equivalent graphics for the WGS assembler reveal qualitatively similar patterns and can be found in the Additional file2.

Bottom Line: We also compared two models previously employed with success for peptide sequence alignment.As with protein alignments, a power-law model seems to approximate the indel errors more accurately, although the results are not so conclusive as to justify a depart from the commonly used affine gap penalty scheme.In whichever case, however, our procedure can be used to estimate more realistic error model parameters.

View Article: PubMed Central - HTML - PubMed

Affiliation: Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), R, Alves Redol 9, Lisboa 1000-029, Portugal. pgsf@kdbio.inesc-id.pt.

ABSTRACT

Background: Sequencing-by-synthesis technologies significantly improve over the Sanger method in terms of speed and cost per base. However, they still usually fail to compete in terms of read length and quality. Current high-throughput implementations of the pyrosequencing technique yield reads whose length approach those of the capillary electrophoresis method. A less obvious question is whether their quality is affected by platform-specific sequencing errors.

Results: We present an empirical study aimed at assessing the quality and characterising sequencing errors for high throughput pyrosequencing data. We have developed a procedure for extracting sequencing error data from genome assemblies and study their characteristics, in particular the length distribution of indel gaps and their relation to the sequence contexts where they occur. We used this procedure to analyse data from three prokaryotic genomes sequenced with the GS FLX technology. We also compared two models previously employed with success for peptide sequence alignment.

Conclusions: We observed an overall very low error rate in the analysed data, with indel errors being much more abundant than substitutions. We also observed a dependence between the length of the gaps and that of the homopolymer context where they occur. As with protein alignments, a power-law model seems to approximate the indel errors more accurately, although the results are not so conclusive as to justify a depart from the commonly used affine gap penalty scheme. In whichever case, however, our procedure can be used to estimate more realistic error model parameters.

Show MeSH
Related in: MedlinePlus