MetaVelvet-SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning.
Bottom Line: We have tackled this problem of classifying chimeric nodes using supervised machine learning to significantly improve the performance of MetaVelvet and developed a new tool, called MetaVelvet-SL.A Support Vector Machine is used for learning the classification model based on 94 features extracted from candidate nodes.In extensive experiments, MetaVelvet-SL outperformed the original MetaVelvet and other state-of-the-art metagenomic assemblers, IDBA-UD, Ray Meta and Omega, to reconstruct accurate longer assemblies with higher N50 scores for both simulated data sets and real data sets of human gut microbial sequences.
Affiliation: Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan.Show MeSH
Mentions: When the total scaffold lengths of two assemblies are quite different in the human gut microbial data sets, the naive use of N50 score is inadequate, because the longer total length decreases the N50 score. The generalized score N-len(x) is more appropriate for comparing scaffold integrity than the raw N50 score.2 N-len(x) is defined by(1)N-len(x)=/Si/suchthat∑j=1i/Sj/≥xand∑j=1i−1/Sj/<x,where S1, S2, … , Sn denote the list of scaffolds in descending order of length as output by an assembler. The N50 score corresponds to the N-len(x) score for x = L/2 (x is 50% of L), where L denotes the total scaffold length. The N-len(x) plots for the MH0006 data sets produced by MetaVelvet-SL, MetaVelvet, IDBA-UD, SOAPdenovo2, Ray Meta and Omega are shown in Fig. 4. MetaVelvet-SL significantly increased the scaffold integrity. For example, when x = 5,000,000, the N-len(x) score of MetaVelvet-SL was 306,496, the N-len(x) score of MetaVelvet was 24,554, the N-len(x) score of IDBA-UD was 178,659, the N-len(x) score of SOAPdenovo2 was 90,861, the N-len(x) score of Ray Meta was 101,726 and the N-len(x) score of Omega was 117,010. (The N-len(x) plots for the MH0012, MH0047, SRS017227, and SRS018661 data sets are shown in Supplementary Figs S1–S4.) As in the MetaVelvet paper, we calculated the area under the curve (AUC) of N-len(x) for 0 < x ≤ L in units of 1,000,000 bp; that is, the cumulative sum of N-len(x) scores (0 < x ≤ L), where L denotes the total scaffold length.Figure 4.
Affiliation: Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan.