Limits...
Exploiting sparseness in de novo genome assembly.

Ye C, Ma ZS, Cannon CH, Pop M, Yu DW - BMC Bioinformatics (2012)

Bottom Line: The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments.We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k-mer graph structure evolved from the de Bruijn graph.We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers.

View Article: PubMed Central - HTML - PubMed

Affiliation: Ecology & Evolution of Plant-Animal Interaction Group, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Menglun, Yunnan 666303 China. cxy@umd.edu

ABSTRACT

Background: The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments.

Methods: In this paper, we demonstrate that constructing a sparse assembly graph which stores only a small fraction of the observed k-mers as nodes and the links between these nodes allows the de novo assembly of even moderately-sized genomes (~500 M) on a typical laptop computer.

Results: We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k-mer graph structure evolved from the de Bruijn graph. We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers.

Show MeSH
From overlap graph to a string graph. (a) an overlap graph, in which all the overlaps are recorded. (b) the string graph, transitive overlap (a, c) is removed.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3369186&req=5

Figure 1: From overlap graph to a string graph. (a) an overlap graph, in which all the overlaps are recorded. (b) the string graph, transitive overlap (a, c) is removed.

Mentions: The string graph approach dramatically reduces the memory requirement by a factor roughly proportional to the depth of coverage. A string graph is an overlap graph where transitive edges have been removed, specifically if read A overlaps reads B and C, and B also overlaps C, (Figure 1a) the overlap (A, C) is removed from the graph as it can be inferred from the overlaps between (A, B), and (B, C) (Figure 1b). As a result, each read only needs to store roughly one overlap (multiple overlaps may need to be recorded due to sequencing errors and repeats), reducing the theoretical memory requirement to roughly 6-18 GB of memory. On real data, a recent assembler relying on the string graph approach was reported to use 54 GB memory for human genome assembly [7].


Exploiting sparseness in de novo genome assembly.

Ye C, Ma ZS, Cannon CH, Pop M, Yu DW - BMC Bioinformatics (2012)

From overlap graph to a string graph. (a) an overlap graph, in which all the overlaps are recorded. (b) the string graph, transitive overlap (a, c) is removed.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3369186&req=5

Figure 1: From overlap graph to a string graph. (a) an overlap graph, in which all the overlaps are recorded. (b) the string graph, transitive overlap (a, c) is removed.
Mentions: The string graph approach dramatically reduces the memory requirement by a factor roughly proportional to the depth of coverage. A string graph is an overlap graph where transitive edges have been removed, specifically if read A overlaps reads B and C, and B also overlaps C, (Figure 1a) the overlap (A, C) is removed from the graph as it can be inferred from the overlaps between (A, B), and (B, C) (Figure 1b). As a result, each read only needs to store roughly one overlap (multiple overlaps may need to be recorded due to sequencing errors and repeats), reducing the theoretical memory requirement to roughly 6-18 GB of memory. On real data, a recent assembler relying on the string graph approach was reported to use 54 GB memory for human genome assembly [7].

Bottom Line: The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments.We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k-mer graph structure evolved from the de Bruijn graph.We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers.

View Article: PubMed Central - HTML - PubMed

Affiliation: Ecology & Evolution of Plant-Animal Interaction Group, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Menglun, Yunnan 666303 China. cxy@umd.edu

ABSTRACT

Background: The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments.

Methods: In this paper, we demonstrate that constructing a sparse assembly graph which stores only a small fraction of the observed k-mers as nodes and the links between these nodes allows the de novo assembly of even moderately-sized genomes (~500 M) on a typical laptop computer.

Results: We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k-mer graph structure evolved from the de Bruijn graph. We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers.

Show MeSH