Limits...
An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data.

Deng X, Naccache SN, Ng T, Federman S, Li L, Chiu CY, Delwart EL - Nucleic Acids Res. (2015)

Bottom Line: Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs.We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly.We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.

View Article: PubMed Central - PubMed

Affiliation: Blood Systems Research Institute, San Francisco, CA 94118, USA Department of Laboratory Medicine, University of California at San Francisco, San Francisco, CA 94107, USA xdeng@bloodsystems.org.

Show MeSH
Normalized measures and CPMs averaged over all target genomes for each assembler: (A) C1000NR distribution; (B) AccuracyNR distribution; (C) CPM distribution; (D) MACRNR distribution; and (E) SpeedNR distribution. These measures are in the range of 0–5 with higher values representing better performance. Numbers in parentheses are numbers of target genomes evaluated for each method. Genomes that no assembler could generate MACR > 1 kb were excluded in the calculation. Note that certain assemblers such as G, W, X and SAVaM failed to finish in many of the datasets due to software issues.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4402509&req=5

Figure 5: Normalized measures and CPMs averaged over all target genomes for each assembler: (A) C1000NR distribution; (B) AccuracyNR distribution; (C) CPM distribution; (D) MACRNR distribution; and (E) SpeedNR distribution. These measures are in the range of 0–5 with higher values representing better performance. Numbers in parentheses are numbers of target genomes evaluated for each method. Genomes that no assembler could generate MACR > 1 kb were excluded in the calculation. Note that certain assemblers such as G, W, X and SAVaM failed to finish in many of the datasets due to software issues.

Mentions: To summarize the results from the three datasets, we calculated the average normalized performance metrics MACRNR, C1000NR, AccuracyNR, SpeedNR and composite performance metrics CPM across target genomes in all datasets for each assembler (Figure 5). For individual assemblers, I and T clearly produced the best C1000NR among individual assemblers; whereas A, S V W, G produced poor sized contigs (Figure 5A). Ensemble strategies including partitioning or M or T produced the best C1000NR. The MACRNR is highly correlated with C1000NR (Figure 5D). T, M, I and ensemble methods using them as a component, however, are among the worst assembler in AccuracyNR (Figure 5B). M and methods with M as a component, are the slowest (Figure 5E), indicating they are not suitable for time-critical diagnosis applications. According to CPM which measures overall contig qualities, the highest ranked assemblers were SAVaC, SAVTaC, SAVaO (Figure 5C). The best individual assembler is I but its CPM is still significantly lower than the best ensemble assemblers. SAVTaC achieved very high C1000NR and MACRNR, but its AccuracyNR and SpeedNR were below average. SAVaC and SAVaO achieved better AccuracyNR and competitive C1000NR and MACRNR. Figure 6 shows the relationship of the normalized measures. Figure 6A shows that assemblers fall in a curved belt, indicating a reciprocal relationship between AccuracyNR and C1000NR. Ensemble methods with M or T as a component in the first assembly had the largest C1000NR and poorest AccuracyNR, suggesting these assemblers may be overly aggressive in assembly of larger contigs. On the side of the curve were DBG assemblers which may be overly conservative in extending contigs. SAVaO and SAVaC are closest to the ideal upper right corner, indicating they achieved balance overall performance. Figure 6B shows that C1000NR is negatively correlated with SpeedNR. C1000NR and MACRNR are both contig size measures and they are closely correlated (Figure 6C). Figure 6D shows positive correlation between AccuracyNR and SpeedNR, indicating faster assemblers are generally conservative and thus generate lower level of chimeric contigs.


An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data.

Deng X, Naccache SN, Ng T, Federman S, Li L, Chiu CY, Delwart EL - Nucleic Acids Res. (2015)

Normalized measures and CPMs averaged over all target genomes for each assembler: (A) C1000NR distribution; (B) AccuracyNR distribution; (C) CPM distribution; (D) MACRNR distribution; and (E) SpeedNR distribution. These measures are in the range of 0–5 with higher values representing better performance. Numbers in parentheses are numbers of target genomes evaluated for each method. Genomes that no assembler could generate MACR > 1 kb were excluded in the calculation. Note that certain assemblers such as G, W, X and SAVaM failed to finish in many of the datasets due to software issues.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4402509&req=5

Figure 5: Normalized measures and CPMs averaged over all target genomes for each assembler: (A) C1000NR distribution; (B) AccuracyNR distribution; (C) CPM distribution; (D) MACRNR distribution; and (E) SpeedNR distribution. These measures are in the range of 0–5 with higher values representing better performance. Numbers in parentheses are numbers of target genomes evaluated for each method. Genomes that no assembler could generate MACR > 1 kb were excluded in the calculation. Note that certain assemblers such as G, W, X and SAVaM failed to finish in many of the datasets due to software issues.
Mentions: To summarize the results from the three datasets, we calculated the average normalized performance metrics MACRNR, C1000NR, AccuracyNR, SpeedNR and composite performance metrics CPM across target genomes in all datasets for each assembler (Figure 5). For individual assemblers, I and T clearly produced the best C1000NR among individual assemblers; whereas A, S V W, G produced poor sized contigs (Figure 5A). Ensemble strategies including partitioning or M or T produced the best C1000NR. The MACRNR is highly correlated with C1000NR (Figure 5D). T, M, I and ensemble methods using them as a component, however, are among the worst assembler in AccuracyNR (Figure 5B). M and methods with M as a component, are the slowest (Figure 5E), indicating they are not suitable for time-critical diagnosis applications. According to CPM which measures overall contig qualities, the highest ranked assemblers were SAVaC, SAVTaC, SAVaO (Figure 5C). The best individual assembler is I but its CPM is still significantly lower than the best ensemble assemblers. SAVTaC achieved very high C1000NR and MACRNR, but its AccuracyNR and SpeedNR were below average. SAVaC and SAVaO achieved better AccuracyNR and competitive C1000NR and MACRNR. Figure 6 shows the relationship of the normalized measures. Figure 6A shows that assemblers fall in a curved belt, indicating a reciprocal relationship between AccuracyNR and C1000NR. Ensemble methods with M or T as a component in the first assembly had the largest C1000NR and poorest AccuracyNR, suggesting these assemblers may be overly aggressive in assembly of larger contigs. On the side of the curve were DBG assemblers which may be overly conservative in extending contigs. SAVaO and SAVaC are closest to the ideal upper right corner, indicating they achieved balance overall performance. Figure 6B shows that C1000NR is negatively correlated with SpeedNR. C1000NR and MACRNR are both contig size measures and they are closely correlated (Figure 6C). Figure 6D shows positive correlation between AccuracyNR and SpeedNR, indicating faster assemblers are generally conservative and thus generate lower level of chimeric contigs.

Bottom Line: Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs.We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly.We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.

View Article: PubMed Central - PubMed

Affiliation: Blood Systems Research Institute, San Francisco, CA 94118, USA Department of Laboratory Medicine, University of California at San Francisco, San Francisco, CA 94107, USA xdeng@bloodsystems.org.

Show MeSH