Limits...
DiScRIBinATE: a rapid method for accurate taxonomic classification of metagenomic sequences.

Ghosh TS, Monzoorul Haque M, Mande SS - BMC Bioinformatics (2010)

Bottom Line: We demonstrate that incorporating this approach reduces binning time by half without any loss in the specificity and accuracy of assignments.Besides, a novel reclassification strategy incorporated in DiScRIBinATE results in reducing the overall misclassification rate to around 3 - 7%.This misclassification rate is 1.5 - 3 times lower as compared to that by SOrt-ITEMS, and 3 - 30 times lower as compared to that by MEGAN.

View Article: PubMed Central - HTML - PubMed

Affiliation: Bio-Sciences Division, Innovation Labs, Tata Consultancy Services, 1 Software Units Layout, Hyderabad 500 081, Andhra Pradesh, India. tarini@atc.tcs.com

ABSTRACT

Background: In metagenomic sequence data, majority of sequences/reads originate from new or partially characterized genomes, the corresponding sequences of which are absent in existing reference databases. Since taxonomic assignment of reads is based on their similarity to sequences from known organisms, the presence of reads originating from new organisms poses a major challenge to taxonomic binning methods. The recently published SOrt-ITEMS algorithm uses an elaborate work-flow to assign reads originating from hitherto unknown genomes with significant accuracy and specificity. Nevertheless, a significant proportion of reads still get misclassified. Besides, the use of an alignment-based orthology step (for improving the specificity of assignments) increases the total binning time of SOrt-ITEMS.

Results: In this paper, we introduce a rapid binning approach called DiScRIBinATE (Distance Score Ratio for Improved Binning And Taxonomic Estimation). DiScRIBinATE replaces the orthology approach of SOrt-ITEMS with a quicker 'alignment-free' approach. We demonstrate that incorporating this approach reduces binning time by half without any loss in the specificity and accuracy of assignments. Besides, a novel reclassification strategy incorporated in DiScRIBinATE results in reducing the overall misclassification rate to around 3 - 7%. This misclassification rate is 1.5 - 3 times lower as compared to that by SOrt-ITEMS, and 3 - 30 times lower as compared to that by MEGAN.

Conclusions: A significant reduction in binning time, coupled with a superior assignment accuracy (as compared to existing binning methods), indicates the immense applicability of the proposed algorithm in rapidly mapping the taxonomic diversity of large metagenomic samples with high accuracy and specificity.

Availability: The program is available on request from the authors.

Show MeSH
Example illustrating the improved binning specificity obtained using 'bit-score/distance' approach. An example illustrating the improved specificity of assignment using the 'bit-score/distance' ratios. In this example, the read originates from Burkholderia ambifaria (the corresponding sequences of which are present in the reference database). Based on the alignment parameters (Identities:I, Positives:P, Gaps) obtained for the best hit, the appropriate taxonomic level of assignment is restricted at the 'Species level'. Identified candidate ancestors are depicted in bold boxes. The read is assigned by DiScRIBinATE at the genus level (i.e. Burkholderia) on account of higher cumulative 'bit-score/distance' ratio. An LCA based approach would have assigned the same read at the phylum level (i.e Proteobacteria).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2957682&req=5

Figure 5: Example illustrating the improved binning specificity obtained using 'bit-score/distance' approach. An example illustrating the improved specificity of assignment using the 'bit-score/distance' ratios. In this example, the read originates from Burkholderia ambifaria (the corresponding sequences of which are present in the reference database). Based on the alignment parameters (Identities:I, Positives:P, Gaps) obtained for the best hit, the appropriate taxonomic level of assignment is restricted at the 'Species level'. Identified candidate ancestors are depicted in bold boxes. The read is assigned by DiScRIBinATE at the genus level (i.e. Burkholderia) on account of higher cumulative 'bit-score/distance' ratio. An LCA based approach would have assigned the same read at the phylum level (i.e Proteobacteria).

Mentions: DiScRIBinATE uses the above principle to improve the specificity of assignments. For each query sequence, DiScRIBinATE identifies a set of candidate ancestors (from amongst the set of significant hits) and assigns the query to a candidate ancestor having the highest bit-score/distance ratio. Using this approach ensures that the query sequence is assigned to a taxon that is taxonomically closest to it (thereby improving specificity). The specificity of assignments using this alternate approach is seen to be comparable to those obtained using the orthology based approach. Figure 5 illustrates an example which demonstrates the process by which this alternative approach helps in improving the specificity of assignments. It is interesting to note that a similar approach had been used earlier for the identification of horizontal gene transfer in metagenomic samples [9].


DiScRIBinATE: a rapid method for accurate taxonomic classification of metagenomic sequences.

Ghosh TS, Monzoorul Haque M, Mande SS - BMC Bioinformatics (2010)

Example illustrating the improved binning specificity obtained using 'bit-score/distance' approach. An example illustrating the improved specificity of assignment using the 'bit-score/distance' ratios. In this example, the read originates from Burkholderia ambifaria (the corresponding sequences of which are present in the reference database). Based on the alignment parameters (Identities:I, Positives:P, Gaps) obtained for the best hit, the appropriate taxonomic level of assignment is restricted at the 'Species level'. Identified candidate ancestors are depicted in bold boxes. The read is assigned by DiScRIBinATE at the genus level (i.e. Burkholderia) on account of higher cumulative 'bit-score/distance' ratio. An LCA based approach would have assigned the same read at the phylum level (i.e Proteobacteria).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2957682&req=5

Figure 5: Example illustrating the improved binning specificity obtained using 'bit-score/distance' approach. An example illustrating the improved specificity of assignment using the 'bit-score/distance' ratios. In this example, the read originates from Burkholderia ambifaria (the corresponding sequences of which are present in the reference database). Based on the alignment parameters (Identities:I, Positives:P, Gaps) obtained for the best hit, the appropriate taxonomic level of assignment is restricted at the 'Species level'. Identified candidate ancestors are depicted in bold boxes. The read is assigned by DiScRIBinATE at the genus level (i.e. Burkholderia) on account of higher cumulative 'bit-score/distance' ratio. An LCA based approach would have assigned the same read at the phylum level (i.e Proteobacteria).
Mentions: DiScRIBinATE uses the above principle to improve the specificity of assignments. For each query sequence, DiScRIBinATE identifies a set of candidate ancestors (from amongst the set of significant hits) and assigns the query to a candidate ancestor having the highest bit-score/distance ratio. Using this approach ensures that the query sequence is assigned to a taxon that is taxonomically closest to it (thereby improving specificity). The specificity of assignments using this alternate approach is seen to be comparable to those obtained using the orthology based approach. Figure 5 illustrates an example which demonstrates the process by which this alternative approach helps in improving the specificity of assignments. It is interesting to note that a similar approach had been used earlier for the identification of horizontal gene transfer in metagenomic samples [9].

Bottom Line: We demonstrate that incorporating this approach reduces binning time by half without any loss in the specificity and accuracy of assignments.Besides, a novel reclassification strategy incorporated in DiScRIBinATE results in reducing the overall misclassification rate to around 3 - 7%.This misclassification rate is 1.5 - 3 times lower as compared to that by SOrt-ITEMS, and 3 - 30 times lower as compared to that by MEGAN.

View Article: PubMed Central - HTML - PubMed

Affiliation: Bio-Sciences Division, Innovation Labs, Tata Consultancy Services, 1 Software Units Layout, Hyderabad 500 081, Andhra Pradesh, India. tarini@atc.tcs.com

ABSTRACT

Background: In metagenomic sequence data, majority of sequences/reads originate from new or partially characterized genomes, the corresponding sequences of which are absent in existing reference databases. Since taxonomic assignment of reads is based on their similarity to sequences from known organisms, the presence of reads originating from new organisms poses a major challenge to taxonomic binning methods. The recently published SOrt-ITEMS algorithm uses an elaborate work-flow to assign reads originating from hitherto unknown genomes with significant accuracy and specificity. Nevertheless, a significant proportion of reads still get misclassified. Besides, the use of an alignment-based orthology step (for improving the specificity of assignments) increases the total binning time of SOrt-ITEMS.

Results: In this paper, we introduce a rapid binning approach called DiScRIBinATE (Distance Score Ratio for Improved Binning And Taxonomic Estimation). DiScRIBinATE replaces the orthology approach of SOrt-ITEMS with a quicker 'alignment-free' approach. We demonstrate that incorporating this approach reduces binning time by half without any loss in the specificity and accuracy of assignments. Besides, a novel reclassification strategy incorporated in DiScRIBinATE results in reducing the overall misclassification rate to around 3 - 7%. This misclassification rate is 1.5 - 3 times lower as compared to that by SOrt-ITEMS, and 3 - 30 times lower as compared to that by MEGAN.

Conclusions: A significant reduction in binning time, coupled with a superior assignment accuracy (as compared to existing binning methods), indicates the immense applicability of the proposed algorithm in rapidly mapping the taxonomic diversity of large metagenomic samples with high accuracy and specificity.

Availability: The program is available on request from the authors.

Show MeSH