Limits...
Error correction of high-throughput sequencing datasets with non-uniform coverage.

Medvedev P, Scott E, Kakaradov B, Pevzner P - Bioinformatics (2011)

Bottom Line: As a result, error correction of sequencing reads remains an important problem.In this article, we develop the method Hammer for error correction without any uniformity assumptions.Hammer is based on a combination of a Hamming graph and a simple probabilistic model for sequencing errors.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, University of California, San Diego, CA, USA. pmedvedev@cs.ucsd.edu

ABSTRACT

Motivation: The continuing improvements to high-throughput sequencing (HTS) platforms have begun to unfold a myriad of new applications. As a result, error correction of sequencing reads remains an important problem. Though several tools do an excellent job of correcting datasets where the reads are sampled close to uniformly, the problem of correcting reads coming from drastically non-uniform datasets, such as those from single-cell sequencing, remains open.

Results: In this article, we develop the method Hammer for error correction without any uniformity assumptions. Hammer is based on a combination of a Hamming graph and a simple probabilistic model for sequencing errors. It is a simple and adaptable algorithm that improves on other tools on non-uniform single-cell data, while achieving comparable results on normal multi-cell data.

Availability: http://www.cs.toronto.edu/~pashadag.

Contact: pmedvedev@cs.ucsd.edu.

Show MeSH

Related in: MedlinePlus

On the left is an example of a typical cluster with good coverage. There are five k-mers clustered together, with five loci having mis-alignments. We compute the consensus string (taking multiplicities into account), which we find is already in the cluster (boxed in). All the k-mers are then corrected to the consenus. On the right is an example of a common cluster in low coverage regions. The generating k-mer was sequenced three times but each time with a single error. There are three k-mers in the cluster, but the consensus (boxed in) has not been sequenced and therefore is not in the cluster. Nevertheless, we correct all the k-mers to the consensus, allowing Hammer to reconstruct new k-mers that are not present in the original data.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117386&req=5

Figure 1: On the left is an example of a typical cluster with good coverage. There are five k-mers clustered together, with five loci having mis-alignments. We compute the consensus string (taking multiplicities into account), which we find is already in the cluster (boxed in). All the k-mers are then corrected to the consenus. On the right is an example of a common cluster in low coverage regions. The generating k-mer was sequenced three times but each time with a single error. There are three k-mers in the cluster, but the consensus (boxed in) has not been sequenced and therefore is not in the cluster. Nevertheless, we correct all the k-mers to the consensus, allowing Hammer to reconstruct new k-mers that are not present in the original data.

Mentions: Finally, we process the clusters one at a time. Let C be the current cluster. If the cluster does not have size one and the consensus string is unique, we correct every k-mer in the cluster to the consensus string (Fig. 1).Fig. 1.


Error correction of high-throughput sequencing datasets with non-uniform coverage.

Medvedev P, Scott E, Kakaradov B, Pevzner P - Bioinformatics (2011)

On the left is an example of a typical cluster with good coverage. There are five k-mers clustered together, with five loci having mis-alignments. We compute the consensus string (taking multiplicities into account), which we find is already in the cluster (boxed in). All the k-mers are then corrected to the consenus. On the right is an example of a common cluster in low coverage regions. The generating k-mer was sequenced three times but each time with a single error. There are three k-mers in the cluster, but the consensus (boxed in) has not been sequenced and therefore is not in the cluster. Nevertheless, we correct all the k-mers to the consensus, allowing Hammer to reconstruct new k-mers that are not present in the original data.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117386&req=5

Figure 1: On the left is an example of a typical cluster with good coverage. There are five k-mers clustered together, with five loci having mis-alignments. We compute the consensus string (taking multiplicities into account), which we find is already in the cluster (boxed in). All the k-mers are then corrected to the consenus. On the right is an example of a common cluster in low coverage regions. The generating k-mer was sequenced three times but each time with a single error. There are three k-mers in the cluster, but the consensus (boxed in) has not been sequenced and therefore is not in the cluster. Nevertheless, we correct all the k-mers to the consensus, allowing Hammer to reconstruct new k-mers that are not present in the original data.
Mentions: Finally, we process the clusters one at a time. Let C be the current cluster. If the cluster does not have size one and the consensus string is unique, we correct every k-mer in the cluster to the consensus string (Fig. 1).Fig. 1.

Bottom Line: As a result, error correction of sequencing reads remains an important problem.In this article, we develop the method Hammer for error correction without any uniformity assumptions.Hammer is based on a combination of a Hamming graph and a simple probabilistic model for sequencing errors.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Engineering, University of California, San Diego, CA, USA. pmedvedev@cs.ucsd.edu

ABSTRACT

Motivation: The continuing improvements to high-throughput sequencing (HTS) platforms have begun to unfold a myriad of new applications. As a result, error correction of sequencing reads remains an important problem. Though several tools do an excellent job of correcting datasets where the reads are sampled close to uniformly, the problem of correcting reads coming from drastically non-uniform datasets, such as those from single-cell sequencing, remains open.

Results: In this article, we develop the method Hammer for error correction without any uniformity assumptions. Hammer is based on a combination of a Hamming graph and a simple probabilistic model for sequencing errors. It is a simple and adaptable algorithm that improves on other tools on non-uniform single-cell data, while achieving comparable results on normal multi-cell data.

Availability: http://www.cs.toronto.edu/~pashadag.

Contact: pmedvedev@cs.ucsd.edu.

Show MeSH
Related in: MedlinePlus