Limits...
An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data.

Deng X, Naccache SN, Ng T, Federman S, Li L, Chiu CY, Delwart EL - Nucleic Acids Res. (2015)

Bottom Line: Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs.We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly.We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.

View Article: PubMed Central - PubMed

Affiliation: Blood Systems Research Institute, San Francisco, CA 94118, USA Department of Laboratory Medicine, University of California at San Francisco, San Francisco, CA 94107, USA xdeng@bloodsystems.org.

Show MeSH

Related in: MedlinePlus

Comparison of different assembly strategies using the in silico-virus spiked BASV datasets. (A) Comparing MACR for different K-mer size using A, V and S on the in silico-virus spiked BASV dataset (SetA-SetJ). Note that V only supports K ≤ 31. The bottom and top of the box are always the first and third quartiles, and the band inside the box is the median; (B) C1000 distribution for each assembler; (C) MACR for each dataset; (D) chimera index; (E) execution time; (F) percentage of execution time spent on OLC final assembly; and (G) contig formation for setB: the blue line represents the 12 648 bp BASV genome. The red lines are contigs that are aligned to the BASV genome. The assemblers of Figure B, D, E and F were ordered by average values. Individual assemblers A, S, V, M, T, W were executed using eight threads and O and C were executed using single thread.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4402509&req=5

Figure 2: Comparison of different assembly strategies using the in silico-virus spiked BASV datasets. (A) Comparing MACR for different K-mer size using A, V and S on the in silico-virus spiked BASV dataset (SetA-SetJ). Note that V only supports K ≤ 31. The bottom and top of the box are always the first and third quartiles, and the band inside the box is the median; (B) C1000 distribution for each assembler; (C) MACR for each dataset; (D) chimera index; (E) execution time; (F) percentage of execution time spent on OLC final assembly; and (G) contig formation for setB: the blue line represents the 12 648 bp BASV genome. The red lines are contigs that are aligned to the BASV genome. The assemblers of Figure B, D, E and F were ordered by average values. Individual assemblers A, S, V, M, T, W were executed using eight threads and O and C were executed using single thread.

Mentions: Most DBG assemblers require that a k-mer size be provided as a configurable parameter. As the choice of an optimal k-mer value is not clear with metagenome assembly, we tested S and A using the ‘in silico-virus spiked’ datasets at increasing k-mer values of 31, 41, 51 and 61 (V does not support k-mer values >31) (Figure 2A). K-mer values ranging from 31 to 61 have previously been shown to be useful for DBG assemblers, whereas shorter k-mer values below 31 seem to generate shorter contigs (35). Using the ‘in silico-virus spiked’ dataset, A performed better than S or V. For the S or A algorithms, no significant differences were observed by varying the k-mer values from 31 to 61 ( P > 0.05, Kraskal–Wallis test). Since k-mer values must be smaller than the read length, we chose k = 31 as providing the greatest flexibility in analysis of very short reads and keeping the parameter constant for comparative benchmarking of the S, A and V algorithms. It should be noted that the choice of optimal k-mer depends on the data being applied. Here we use k = 31 for this study, but it may not be optimal on other datasets.


An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data.

Deng X, Naccache SN, Ng T, Federman S, Li L, Chiu CY, Delwart EL - Nucleic Acids Res. (2015)

Comparison of different assembly strategies using the in silico-virus spiked BASV datasets. (A) Comparing MACR for different K-mer size using A, V and S on the in silico-virus spiked BASV dataset (SetA-SetJ). Note that V only supports K ≤ 31. The bottom and top of the box are always the first and third quartiles, and the band inside the box is the median; (B) C1000 distribution for each assembler; (C) MACR for each dataset; (D) chimera index; (E) execution time; (F) percentage of execution time spent on OLC final assembly; and (G) contig formation for setB: the blue line represents the 12 648 bp BASV genome. The red lines are contigs that are aligned to the BASV genome. The assemblers of Figure B, D, E and F were ordered by average values. Individual assemblers A, S, V, M, T, W were executed using eight threads and O and C were executed using single thread.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4402509&req=5

Figure 2: Comparison of different assembly strategies using the in silico-virus spiked BASV datasets. (A) Comparing MACR for different K-mer size using A, V and S on the in silico-virus spiked BASV dataset (SetA-SetJ). Note that V only supports K ≤ 31. The bottom and top of the box are always the first and third quartiles, and the band inside the box is the median; (B) C1000 distribution for each assembler; (C) MACR for each dataset; (D) chimera index; (E) execution time; (F) percentage of execution time spent on OLC final assembly; and (G) contig formation for setB: the blue line represents the 12 648 bp BASV genome. The red lines are contigs that are aligned to the BASV genome. The assemblers of Figure B, D, E and F were ordered by average values. Individual assemblers A, S, V, M, T, W were executed using eight threads and O and C were executed using single thread.
Mentions: Most DBG assemblers require that a k-mer size be provided as a configurable parameter. As the choice of an optimal k-mer value is not clear with metagenome assembly, we tested S and A using the ‘in silico-virus spiked’ datasets at increasing k-mer values of 31, 41, 51 and 61 (V does not support k-mer values >31) (Figure 2A). K-mer values ranging from 31 to 61 have previously been shown to be useful for DBG assemblers, whereas shorter k-mer values below 31 seem to generate shorter contigs (35). Using the ‘in silico-virus spiked’ dataset, A performed better than S or V. For the S or A algorithms, no significant differences were observed by varying the k-mer values from 31 to 61 ( P > 0.05, Kraskal–Wallis test). Since k-mer values must be smaller than the read length, we chose k = 31 as providing the greatest flexibility in analysis of very short reads and keeping the parameter constant for comparative benchmarking of the S, A and V algorithms. It should be noted that the choice of optimal k-mer depends on the data being applied. Here we use k = 31 for this study, but it may not be optimal on other datasets.

Bottom Line: Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs.We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly.We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.

View Article: PubMed Central - PubMed

Affiliation: Blood Systems Research Institute, San Francisco, CA 94118, USA Department of Laboratory Medicine, University of California at San Francisco, San Francisco, CA 94107, USA xdeng@bloodsystems.org.

Show MeSH
Related in: MedlinePlus