Limits...
MBBC: an efficient approach for metagenomic binning based on clustering.

Wang Y, Hu H, Li X - BMC Bioinformatics (2015)

Bottom Line: We have developed a novel method for binning metagenomic reads based on clustering.This method is demonstrated to reliably predict species numbers, genome sizes, relative species abundances, and k-mer coverage in simple datasets.Our method also has a high accuracy in read binning.

View Article: PubMed Central - PubMed

Affiliation: Department of Electric Engineering and Computer Science, University of Central Florida, Orlando, FL, 32816, USA. ying2010@knights.ucf.edu.

ABSTRACT

Background: Binning environmental shotgun reads is one of the most fundamental tasks in metagenomic studies, in which mixed reads from different species or operational taxonomical units (OTUs) are separated into different groups. While dozens of binning methods are available, there is still room for improvement.

Results: We developed a novel taxonomy-independent approach called MBBC (Metagenomic Binning Based on Clustering) to cluster environmental shotgun reads, by considering k-mer frequency in reads and Markov properties of the inferred OTUs. Tested on twelve simulated datasets, MBBC reliably estimated the species number, the genome size, and the relative abundance of each species, independent of whether there are errors in reads. Tested on multiple experimental datasets, MBBC outperformed two state-of-the-art taxonomy-independent methods, in terms of the accuracy of the estimated species number, genome sizes, and percentages of correctly assigned reads, among other metrics.

Conclusions: We have developed a novel method for binning metagenomic reads based on clustering. This method is demonstrated to reliably predict species numbers, genome sizes, relative species abundances, and k-mer coverage in simple datasets. Our method also has a high accuracy in read binning. The MBBC software is freely available at http://eecs.ucf.edu/~xiaoman/MBBC/MBBC.html .

Show MeSH
The procedure of read clustering in MBBC. The output on the right from each of the main steps on the left is connected with the corresponding steps.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4339733&req=5

Fig2: The procedure of read clustering in MBBC. The output on the right from each of the main steps on the left is connected with the corresponding steps.

Mentions: We developed a novel method called MBBC (FigureĀ 2). Our method starts from an EM algorithm to group k-mers in reads based on their frequencies in reads. The assumption behind the k-mer grouping is that the frequency of k-mers in reads follows a mixture of Poisson distributions [33]. Next, MBBC iteratively estimates the number of k-mers that occur 0 to 3 times in reads and runs the EM algorithm to estimate the parameters of the mixed Poisson distributions. The rationale behind the iterative estimation is that these numbers are either unobserved or inaccurate and thus affect the estimation of other parameters [14,15]. Next, MBBC determines the species number and initially groups reads based on the Poisson parameters. MBBC then iteratively models the Markov property of the reads in each group and reassigns reads to groups. Finally, MBBC determines the genome sizes and other metrics based on the assigned reads and the estimated parameters. The details are presented in the following.Figure 2


MBBC: an efficient approach for metagenomic binning based on clustering.

Wang Y, Hu H, Li X - BMC Bioinformatics (2015)

The procedure of read clustering in MBBC. The output on the right from each of the main steps on the left is connected with the corresponding steps.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4339733&req=5

Fig2: The procedure of read clustering in MBBC. The output on the right from each of the main steps on the left is connected with the corresponding steps.
Mentions: We developed a novel method called MBBC (FigureĀ 2). Our method starts from an EM algorithm to group k-mers in reads based on their frequencies in reads. The assumption behind the k-mer grouping is that the frequency of k-mers in reads follows a mixture of Poisson distributions [33]. Next, MBBC iteratively estimates the number of k-mers that occur 0 to 3 times in reads and runs the EM algorithm to estimate the parameters of the mixed Poisson distributions. The rationale behind the iterative estimation is that these numbers are either unobserved or inaccurate and thus affect the estimation of other parameters [14,15]. Next, MBBC determines the species number and initially groups reads based on the Poisson parameters. MBBC then iteratively models the Markov property of the reads in each group and reassigns reads to groups. Finally, MBBC determines the genome sizes and other metrics based on the assigned reads and the estimated parameters. The details are presented in the following.Figure 2

Bottom Line: We have developed a novel method for binning metagenomic reads based on clustering.This method is demonstrated to reliably predict species numbers, genome sizes, relative species abundances, and k-mer coverage in simple datasets.Our method also has a high accuracy in read binning.

View Article: PubMed Central - PubMed

Affiliation: Department of Electric Engineering and Computer Science, University of Central Florida, Orlando, FL, 32816, USA. ying2010@knights.ucf.edu.

ABSTRACT

Background: Binning environmental shotgun reads is one of the most fundamental tasks in metagenomic studies, in which mixed reads from different species or operational taxonomical units (OTUs) are separated into different groups. While dozens of binning methods are available, there is still room for improvement.

Results: We developed a novel taxonomy-independent approach called MBBC (Metagenomic Binning Based on Clustering) to cluster environmental shotgun reads, by considering k-mer frequency in reads and Markov properties of the inferred OTUs. Tested on twelve simulated datasets, MBBC reliably estimated the species number, the genome size, and the relative abundance of each species, independent of whether there are errors in reads. Tested on multiple experimental datasets, MBBC outperformed two state-of-the-art taxonomy-independent methods, in terms of the accuracy of the estimated species number, genome sizes, and percentages of correctly assigned reads, among other metrics.

Conclusions: We have developed a novel method for binning metagenomic reads based on clustering. This method is demonstrated to reliably predict species numbers, genome sizes, relative species abundances, and k-mer coverage in simple datasets. Our method also has a high accuracy in read binning. The MBBC software is freely available at http://eecs.ucf.edu/~xiaoman/MBBC/MBBC.html .

Show MeSH