Limits...
Pollux: platform independent error correction of single and mixed genomes.

Marinier E, Brown DG, McConkey BJ - BMC Bioinformatics (2015)

Bottom Line: Correcting these errors improves the quality of assemblies and projects which benefit from error-free reads.Introduced errors are 20 to 70 times more rare than successfully corrected errors.Pollux is highly effective at correcting errors across platforms, and is consistently able to perform as well or better than currently available error correction software.

View Article: PubMed Central - PubMed

Affiliation: David R. Cheriton School of Computer Science, University of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1, Canada. eric.marinier@uwaterloo.ca.

ABSTRACT

Background: Second-generation sequencers generate millions of relatively short, but error-prone, reads. These errors make sequence assembly and other downstream projects more challenging. Correcting these errors improves the quality of assemblies and projects which benefit from error-free reads.

Results: We have developed a general-purpose error corrector that corrects errors introduced by Illumina, Ion Torrent, and Roche 454 sequencing technologies and can be applied to single- or mixed-genome data. In addition to correcting substitution errors, we locate and correct insertion, deletion, and homopolymer errors while remaining sensitive to low coverage areas of sequencing projects. Using published data sets, we correct 94% of Illumina MiSeq errors, 88% of Ion Torrent PGM errors, 85% of Roche 454 GS Junior errors. Introduced errors are 20 to 70 times more rare than successfully corrected errors. Furthermore, we show that the quality of assemblies improves when reads are corrected by our software.

Conclusions: Pollux is highly effective at correcting errors across platforms, and is consistently able to perform as well or better than currently available error correction software. Pollux provides general-purpose error correction and may be used in applications with or without assembly.

Show MeSH
Thek-mer counts associated with an Illumina MiSeq read containing two highlighted substitution errors. The k-mers are of length 31. A data point corresponds to the number of times a left-anchored k-mer, starting at a given position within the sequence, has been observed in the entire read set. The first error is located near the middle of the read and affects the counts associated with 31 k-mers. The second error located only four positions in from the 3’ end of the read and affects only 4 k-mer counts.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4307147&req=5

Fig1: Thek-mer counts associated with an Illumina MiSeq read containing two highlighted substitution errors. The k-mers are of length 31. A data point corresponds to the number of times a left-anchored k-mer, starting at a given position within the sequence, has been observed in the entire read set. The first error is located near the middle of the read and affects the counts associated with 31 k-mers. The second error located only four positions in from the 3’ end of the read and affects only 4 k-mer counts.

Mentions: Similar to other methods [9,11], we approach the problem of error correction using k-mers, consecutive k-letter sequences identified in reads. However, our approach does not identify individual k-mers as erroneous, but rather compares the counts of adjacent k-mers within reads and identifies discontinuities. These discontinuities within reads are used to find error locations and evaluate correctness. We decompose a read into its k-mers and calculate their associated k-mer counts (Figure 1), which are the number of times a given k-mer has appeared in the entire set of reads. We use default k-mer lengths of 31, which are slightly larger than typically used in assembly [12,13]. We choose to use longer k-mers because this lets us avoid common short repeats which might otherwise confound our correction procedure. We choose k = 31, because it is the longest odd k that can be represented in a 64-bit word. We only record k-mers observed in the data set, so we maintain an extremely small subset of all possible k-mers of length 31. A read that is not erroneous is assumed to have a k-mer count profile that is reflective of a random sampling process, given local coverage. In contrast, a read that contains an error is likely to have k-mer counts that deviate unexpectedly from this random process. A substitution error, located at least k bases away from the ends of the read, will result in kk-mer counts affected. If this error is unique within the read set, these counts will drop to 1. A similarly defined insertion error will affect k+nk-mer counts and a deletion error will affect k−n counts, where n is the length of the indel. These unexpected drops in k-mer counts to a low depth are often erroneous, but we do not immediately make this assumption.Figure 1


Pollux: platform independent error correction of single and mixed genomes.

Marinier E, Brown DG, McConkey BJ - BMC Bioinformatics (2015)

Thek-mer counts associated with an Illumina MiSeq read containing two highlighted substitution errors. The k-mers are of length 31. A data point corresponds to the number of times a left-anchored k-mer, starting at a given position within the sequence, has been observed in the entire read set. The first error is located near the middle of the read and affects the counts associated with 31 k-mers. The second error located only four positions in from the 3’ end of the read and affects only 4 k-mer counts.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4307147&req=5

Fig1: Thek-mer counts associated with an Illumina MiSeq read containing two highlighted substitution errors. The k-mers are of length 31. A data point corresponds to the number of times a left-anchored k-mer, starting at a given position within the sequence, has been observed in the entire read set. The first error is located near the middle of the read and affects the counts associated with 31 k-mers. The second error located only four positions in from the 3’ end of the read and affects only 4 k-mer counts.
Mentions: Similar to other methods [9,11], we approach the problem of error correction using k-mers, consecutive k-letter sequences identified in reads. However, our approach does not identify individual k-mers as erroneous, but rather compares the counts of adjacent k-mers within reads and identifies discontinuities. These discontinuities within reads are used to find error locations and evaluate correctness. We decompose a read into its k-mers and calculate their associated k-mer counts (Figure 1), which are the number of times a given k-mer has appeared in the entire set of reads. We use default k-mer lengths of 31, which are slightly larger than typically used in assembly [12,13]. We choose to use longer k-mers because this lets us avoid common short repeats which might otherwise confound our correction procedure. We choose k = 31, because it is the longest odd k that can be represented in a 64-bit word. We only record k-mers observed in the data set, so we maintain an extremely small subset of all possible k-mers of length 31. A read that is not erroneous is assumed to have a k-mer count profile that is reflective of a random sampling process, given local coverage. In contrast, a read that contains an error is likely to have k-mer counts that deviate unexpectedly from this random process. A substitution error, located at least k bases away from the ends of the read, will result in kk-mer counts affected. If this error is unique within the read set, these counts will drop to 1. A similarly defined insertion error will affect k+nk-mer counts and a deletion error will affect k−n counts, where n is the length of the indel. These unexpected drops in k-mer counts to a low depth are often erroneous, but we do not immediately make this assumption.Figure 1

Bottom Line: Correcting these errors improves the quality of assemblies and projects which benefit from error-free reads.Introduced errors are 20 to 70 times more rare than successfully corrected errors.Pollux is highly effective at correcting errors across platforms, and is consistently able to perform as well or better than currently available error correction software.

View Article: PubMed Central - PubMed

Affiliation: David R. Cheriton School of Computer Science, University of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1, Canada. eric.marinier@uwaterloo.ca.

ABSTRACT

Background: Second-generation sequencers generate millions of relatively short, but error-prone, reads. These errors make sequence assembly and other downstream projects more challenging. Correcting these errors improves the quality of assemblies and projects which benefit from error-free reads.

Results: We have developed a general-purpose error corrector that corrects errors introduced by Illumina, Ion Torrent, and Roche 454 sequencing technologies and can be applied to single- or mixed-genome data. In addition to correcting substitution errors, we locate and correct insertion, deletion, and homopolymer errors while remaining sensitive to low coverage areas of sequencing projects. Using published data sets, we correct 94% of Illumina MiSeq errors, 88% of Ion Torrent PGM errors, 85% of Roche 454 GS Junior errors. Introduced errors are 20 to 70 times more rare than successfully corrected errors. Furthermore, we show that the quality of assemblies improves when reads are corrected by our software.

Conclusions: Pollux is highly effective at correcting errors across platforms, and is consistently able to perform as well or better than currently available error correction software. Pollux provides general-purpose error correction and may be used in applications with or without assembly.

Show MeSH