Limits...
Exploiting sparseness in de novo genome assembly.

Ye C, Ma ZS, Cannon CH, Pop M, Yu DW - BMC Bioinformatics (2012)

Bottom Line: The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments.We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k-mer graph structure evolved from the de Bruijn graph.We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers.

View Article: PubMed Central - HTML - PubMed

Affiliation: Ecology & Evolution of Plant-Animal Interaction Group, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Menglun, Yunnan 666303 China. cxy@umd.edu

ABSTRACT

Background: The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments.

Methods: In this paper, we demonstrate that constructing a sparse assembly graph which stores only a small fraction of the observed k-mers as nodes and the links between these nodes allows the de novo assembly of even moderately-sized genomes (~500 M) on a typical laptop computer.

Results: We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k-mer graph structure evolved from the de Bruijn graph. We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers.

Show MeSH
A node with branches in the de Bruijn graph and the sparse k-mer graph. (a) A node with branches in a de Bruijn graph. (b) The binary implementation of (a). (c) A node with branches in a sparse k-mer graph. (d) The binary implementation of (c). The k-mers which are nodes in the graph are squared in the blocks. Neighbouring nucleotides indicating the edges of the graph are circled.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3369186&req=5

Figure 2: A node with branches in the de Bruijn graph and the sparse k-mer graph. (a) A node with branches in a de Bruijn graph. (b) The binary implementation of (a). (c) A node with branches in a sparse k-mer graph. (d) The binary implementation of (c). The k-mers which are nodes in the graph are squared in the blocks. Neighbouring nucleotides indicating the edges of the graph are circled.

Mentions: In the de Bruijn graph, edges can be implicitly represented by saving only the presence of the neighbouring nucleotides (at most 4 for each k-mer). A common first stage of de Bruijn graph-based de novo assemblers is to build the graph by storing all the k-mers and their neighbouring nucleotide(s). A k-mer is considered being different only in orientation with its reverse complement, and only one of the two (chosen by lexical-order) is saved. Let all k-mers be encoded in bits: 00, 01, 10, 11, respectively, for A, C, G, T, and let 4 bits be used to indicate the presence/absence of the 4 possible edges/nucleotides on every side (Figure 2a, b). Thus, each k-mer uses 2 × k + 4 × 2 bits of memory, and the minimum space requirement S1 for a genome of size g is approximately S1 = G × (2 × k + 4 × 2), assuming no additional information needs to be saved. Note that this number does not, in theory, depend on depth of coverage (a k-mer is only stored once irrespective of how many reads contain it). However sequencing errors add a huge number of false k-mers, thus extra space has to be used to reach successful assembly.


Exploiting sparseness in de novo genome assembly.

Ye C, Ma ZS, Cannon CH, Pop M, Yu DW - BMC Bioinformatics (2012)

A node with branches in the de Bruijn graph and the sparse k-mer graph. (a) A node with branches in a de Bruijn graph. (b) The binary implementation of (a). (c) A node with branches in a sparse k-mer graph. (d) The binary implementation of (c). The k-mers which are nodes in the graph are squared in the blocks. Neighbouring nucleotides indicating the edges of the graph are circled.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3369186&req=5

Figure 2: A node with branches in the de Bruijn graph and the sparse k-mer graph. (a) A node with branches in a de Bruijn graph. (b) The binary implementation of (a). (c) A node with branches in a sparse k-mer graph. (d) The binary implementation of (c). The k-mers which are nodes in the graph are squared in the blocks. Neighbouring nucleotides indicating the edges of the graph are circled.
Mentions: In the de Bruijn graph, edges can be implicitly represented by saving only the presence of the neighbouring nucleotides (at most 4 for each k-mer). A common first stage of de Bruijn graph-based de novo assemblers is to build the graph by storing all the k-mers and their neighbouring nucleotide(s). A k-mer is considered being different only in orientation with its reverse complement, and only one of the two (chosen by lexical-order) is saved. Let all k-mers be encoded in bits: 00, 01, 10, 11, respectively, for A, C, G, T, and let 4 bits be used to indicate the presence/absence of the 4 possible edges/nucleotides on every side (Figure 2a, b). Thus, each k-mer uses 2 × k + 4 × 2 bits of memory, and the minimum space requirement S1 for a genome of size g is approximately S1 = G × (2 × k + 4 × 2), assuming no additional information needs to be saved. Note that this number does not, in theory, depend on depth of coverage (a k-mer is only stored once irrespective of how many reads contain it). However sequencing errors add a huge number of false k-mers, thus extra space has to be used to reach successful assembly.

Bottom Line: The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments.We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k-mer graph structure evolved from the de Bruijn graph.We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers.

View Article: PubMed Central - HTML - PubMed

Affiliation: Ecology & Evolution of Plant-Animal Interaction Group, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Menglun, Yunnan 666303 China. cxy@umd.edu

ABSTRACT

Background: The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments.

Methods: In this paper, we demonstrate that constructing a sparse assembly graph which stores only a small fraction of the observed k-mers as nodes and the links between these nodes allows the de novo assembly of even moderately-sized genomes (~500 M) on a typical laptop computer.

Results: We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k-mer graph structure evolved from the de Bruijn graph. We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers.

Show MeSH