An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data.
Bottom Line: Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs.We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly.We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.
Affiliation: Blood Systems Research Institute, San Francisco, CA 94118, USA Department of Laboratory Medicine, University of California at San Francisco, San Francisco, CA 94107, USA firstname.lastname@example.org.Show MeSH
Mentions: To summarize the results from the three datasets, we calculated the average normalized performance metrics MACRNR, C1000NR, AccuracyNR, SpeedNR and composite performance metrics CPM across target genomes in all datasets for each assembler (Figure 5). For individual assemblers, I and T clearly produced the best C1000NR among individual assemblers; whereas A, S V W, G produced poor sized contigs (Figure 5A). Ensemble strategies including partitioning or M or T produced the best C1000NR. The MACRNR is highly correlated with C1000NR (Figure 5D). T, M, I and ensemble methods using them as a component, however, are among the worst assembler in AccuracyNR (Figure 5B). M and methods with M as a component, are the slowest (Figure 5E), indicating they are not suitable for time-critical diagnosis applications. According to CPM which measures overall contig qualities, the highest ranked assemblers were SAVaC, SAVTaC, SAVaO (Figure 5C). The best individual assembler is I but its CPM is still significantly lower than the best ensemble assemblers. SAVTaC achieved very high C1000NR and MACRNR, but its AccuracyNR and SpeedNR were below average. SAVaC and SAVaO achieved better AccuracyNR and competitive C1000NR and MACRNR. Figure 6 shows the relationship of the normalized measures. Figure 6A shows that assemblers fall in a curved belt, indicating a reciprocal relationship between AccuracyNR and C1000NR. Ensemble methods with M or T as a component in the first assembly had the largest C1000NR and poorest AccuracyNR, suggesting these assemblers may be overly aggressive in assembly of larger contigs. On the side of the curve were DBG assemblers which may be overly conservative in extending contigs. SAVaO and SAVaC are closest to the ideal upper right corner, indicating they achieved balance overall performance. Figure 6B shows that C1000NR is negatively correlated with SpeedNR. C1000NR and MACRNR are both contig size measures and they are closely correlated (Figure 6C). Figure 6D shows positive correlation between AccuracyNR and SpeedNR, indicating faster assemblers are generally conservative and thus generate lower level of chimeric contigs.
Affiliation: Blood Systems Research Institute, San Francisco, CA 94118, USA Department of Laboratory Medicine, University of California at San Francisco, San Francisco, CA 94107, USA email@example.com.