Limits...
An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data.

Deng X, Naccache SN, Ng T, Federman S, Li L, Chiu CY, Delwart EL - Nucleic Acids Res. (2015)

Bottom Line: Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs.We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly.We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.

View Article: PubMed Central - PubMed

Affiliation: Blood Systems Research Institute, San Francisco, CA 94118, USA Department of Laboratory Medicine, University of California at San Francisco, San Francisco, CA 94107, USA xdeng@bloodsystems.org.

Show MeSH

Related in: MedlinePlus

Motivation and design of the ensemble assembly strategy. (A) Detection rates using blastx at various sequence lengths and mutation rates. Sequences were randomly extracted from virus RefSeq at various lengths (200 bp, 500 bp, 1000 bp, 2000 bp). Each base were mutated at different probability (P = 0, 0.1, 0.2, 0.3, 0.4, 0.5) to simulate various degrees of divergence. (B) Detection rates using blastn at various sequence lengths and mutation rates. (C) The ensemble assembler that integrates DBG assemblers and OLC assemblers. The cleaned reads were first assembled individual DBG assemblers, partitioned assemblers or Mira4. The output of the first step is combined, length filtered and feed into the OLC assemblers for final assembly. The choice of individual assemblers as components can generate a number of ensemble assembly strategies.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4402509&req=5

Figure 1: Motivation and design of the ensemble assembly strategy. (A) Detection rates using blastx at various sequence lengths and mutation rates. Sequences were randomly extracted from virus RefSeq at various lengths (200 bp, 500 bp, 1000 bp, 2000 bp). Each base were mutated at different probability (P = 0, 0.1, 0.2, 0.3, 0.4, 0.5) to simulate various degrees of divergence. (B) Detection rates using blastn at various sequence lengths and mutation rates. (C) The ensemble assembler that integrates DBG assemblers and OLC assemblers. The cleaned reads were first assembled individual DBG assemblers, partitioned assemblers or Mira4. The output of the first step is combined, length filtered and feed into the OLC assemblers for final assembly. The choice of individual assemblers as components can generate a number of ensemble assembly strategies.

Mentions: As a proof of concept that longer viral contigs can be better detected, sequences of various lengths (200, 500, 1000 and 2000 bp) were extracted from Virus RefSeq (Release 61) and mutated at various probabilities (0, 0.1, 0.2, 0.3, 0.4, 0.5) for each base. We then applied blastx and blastn with E-value 0.01 as cutoff on the simulated contigs against Virus RefSeq protein or nucleotide database. Figure 1A shows that longer contigs clearly have a better chance to be detected by blastx, which is especially true for highly divergent contigs. Figure 1B shows the same pattern using blastn. However, amino acid-based search shows better detection rate for highly divergent organisms than nucleotide-based homology search.


An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data.

Deng X, Naccache SN, Ng T, Federman S, Li L, Chiu CY, Delwart EL - Nucleic Acids Res. (2015)

Motivation and design of the ensemble assembly strategy. (A) Detection rates using blastx at various sequence lengths and mutation rates. Sequences were randomly extracted from virus RefSeq at various lengths (200 bp, 500 bp, 1000 bp, 2000 bp). Each base were mutated at different probability (P = 0, 0.1, 0.2, 0.3, 0.4, 0.5) to simulate various degrees of divergence. (B) Detection rates using blastn at various sequence lengths and mutation rates. (C) The ensemble assembler that integrates DBG assemblers and OLC assemblers. The cleaned reads were first assembled individual DBG assemblers, partitioned assemblers or Mira4. The output of the first step is combined, length filtered and feed into the OLC assemblers for final assembly. The choice of individual assemblers as components can generate a number of ensemble assembly strategies.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4402509&req=5

Figure 1: Motivation and design of the ensemble assembly strategy. (A) Detection rates using blastx at various sequence lengths and mutation rates. Sequences were randomly extracted from virus RefSeq at various lengths (200 bp, 500 bp, 1000 bp, 2000 bp). Each base were mutated at different probability (P = 0, 0.1, 0.2, 0.3, 0.4, 0.5) to simulate various degrees of divergence. (B) Detection rates using blastn at various sequence lengths and mutation rates. (C) The ensemble assembler that integrates DBG assemblers and OLC assemblers. The cleaned reads were first assembled individual DBG assemblers, partitioned assemblers or Mira4. The output of the first step is combined, length filtered and feed into the OLC assemblers for final assembly. The choice of individual assemblers as components can generate a number of ensemble assembly strategies.
Mentions: As a proof of concept that longer viral contigs can be better detected, sequences of various lengths (200, 500, 1000 and 2000 bp) were extracted from Virus RefSeq (Release 61) and mutated at various probabilities (0, 0.1, 0.2, 0.3, 0.4, 0.5) for each base. We then applied blastx and blastn with E-value 0.01 as cutoff on the simulated contigs against Virus RefSeq protein or nucleotide database. Figure 1A shows that longer contigs clearly have a better chance to be detected by blastx, which is especially true for highly divergent contigs. Figure 1B shows the same pattern using blastn. However, amino acid-based search shows better detection rate for highly divergent organisms than nucleotide-based homology search.

Bottom Line: Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs.We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly.We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.

View Article: PubMed Central - PubMed

Affiliation: Blood Systems Research Institute, San Francisco, CA 94118, USA Department of Laboratory Medicine, University of California at San Francisco, San Francisco, CA 94107, USA xdeng@bloodsystems.org.

Show MeSH
Related in: MedlinePlus