Limits...
Using the RDP classifier to predict taxonomic novelty and reduce the search space for finding novel organisms.

Lan Y, Wang Q, Cole JR, Rosen GL - PLoS ONE (2012)

Bottom Line: We conclude that selecting a read-length appropriate RDP bootstrap score can significantly reduce the search space for identifying novel genera and higher levels in taxonomy.In addition, having a well-represented database significantly improves performance while having genera that are "highly" similar does not make a significant improvement.On a real dataset from an Amazon Terra Preta soil sample, we show that the detector can predict (or correlates to) whether novel sequences will be assigned to new taxa when the RDP database "doubles" in the future.

View Article: PubMed Central - PubMed

Affiliation: School of Biomedical Engineering, Science, and Health Systems, Drexel University, Philadelphia, Pennsylvania, United States of America.

ABSTRACT

Background: Currently, the naïve Bayesian classifier provided by the Ribosomal Database Project (RDP) is one of the most widely used tools to classify 16S rRNA sequences, mainly collected from environmental samples. We show that RDP has 97+% assignment accuracy and is fast for 250 bp and longer reads when the read originates from a taxon known to the database. Because most environmental samples will contain organisms from taxa whose 16S rRNA genes have not been previously sequenced, we aim to benchmark how well the RDP classifier and other competing methods can discriminate these novel taxa from known taxa.

Principal findings: Because each fragment is assigned a score (containing likelihood or confidence information such as the boostrap score in the RDP classifier), we "train" a threshold to discriminate between novel and known organisms and observe its performance on a test set. The threshold that we determine tends to be conservative (low sensitivity but high specificity) for naïve Bayesian methods. Nonetheless, our method performs better with the RDP classifier than the other methods tested, measured by the f-measure and the area-under-the-curve on the receiver operating characteristic of the test set. By constraining the database to well-represented genera, sensitivity improves 3-15%. Finally, we show that the detector is a good predictor to determine novel abundant taxa (especially for finer levels of taxonomy where novelty is more likely to be present).

Conclusions: We conclude that selecting a read-length appropriate RDP bootstrap score can significantly reduce the search space for identifying novel genera and higher levels in taxonomy. In addition, having a well-represented database significantly improves performance while having genera that are "highly" similar does not make a significant improvement. On a real dataset from an Amazon Terra Preta soil sample, we show that the detector can predict (or correlates to) whether novel sequences will be assigned to new taxa when the RDP database "doubles" in the future.

Show MeSH
The sensitivity, specificity, and f-measure comparison of novel/known detection of the 500 bp read test dataset on the genus-level.RDP's bootstrap performs the best for being able to discriminate between reads from known and novel origin, with around 76% for the combined f-measure.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3293824&req=5

pone-0032491-g007: The sensitivity, specificity, and f-measure comparison of novel/known detection of the 500 bp read test dataset on the genus-level.RDP's bootstrap performs the best for being able to discriminate between reads from known and novel origin, with around 76% for the combined f-measure.

Mentions: Using the thresholds derived from the process in Fig. 3, we evaluated the process on NBC/RDP's likelihood scores and Phymm/PhymmBL's raw scores. We also evaluated how well the chosen PhymmBL confidence scores and the PhylOTU method performed on the test set; for the latter, we used the training set, but did not go through the 5-fold training shown in Fig. 3. The results of the methods on the test dataset are shown in Fig. 7 for 500 bp reads (with 250 bp and 100 bp reads in Figs. 8 and 9 respectively).


Using the RDP classifier to predict taxonomic novelty and reduce the search space for finding novel organisms.

Lan Y, Wang Q, Cole JR, Rosen GL - PLoS ONE (2012)

The sensitivity, specificity, and f-measure comparison of novel/known detection of the 500 bp read test dataset on the genus-level.RDP's bootstrap performs the best for being able to discriminate between reads from known and novel origin, with around 76% for the combined f-measure.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3293824&req=5

pone-0032491-g007: The sensitivity, specificity, and f-measure comparison of novel/known detection of the 500 bp read test dataset on the genus-level.RDP's bootstrap performs the best for being able to discriminate between reads from known and novel origin, with around 76% for the combined f-measure.
Mentions: Using the thresholds derived from the process in Fig. 3, we evaluated the process on NBC/RDP's likelihood scores and Phymm/PhymmBL's raw scores. We also evaluated how well the chosen PhymmBL confidence scores and the PhylOTU method performed on the test set; for the latter, we used the training set, but did not go through the 5-fold training shown in Fig. 3. The results of the methods on the test dataset are shown in Fig. 7 for 500 bp reads (with 250 bp and 100 bp reads in Figs. 8 and 9 respectively).

Bottom Line: We conclude that selecting a read-length appropriate RDP bootstrap score can significantly reduce the search space for identifying novel genera and higher levels in taxonomy.In addition, having a well-represented database significantly improves performance while having genera that are "highly" similar does not make a significant improvement.On a real dataset from an Amazon Terra Preta soil sample, we show that the detector can predict (or correlates to) whether novel sequences will be assigned to new taxa when the RDP database "doubles" in the future.

View Article: PubMed Central - PubMed

Affiliation: School of Biomedical Engineering, Science, and Health Systems, Drexel University, Philadelphia, Pennsylvania, United States of America.

ABSTRACT

Background: Currently, the naïve Bayesian classifier provided by the Ribosomal Database Project (RDP) is one of the most widely used tools to classify 16S rRNA sequences, mainly collected from environmental samples. We show that RDP has 97+% assignment accuracy and is fast for 250 bp and longer reads when the read originates from a taxon known to the database. Because most environmental samples will contain organisms from taxa whose 16S rRNA genes have not been previously sequenced, we aim to benchmark how well the RDP classifier and other competing methods can discriminate these novel taxa from known taxa.

Principal findings: Because each fragment is assigned a score (containing likelihood or confidence information such as the boostrap score in the RDP classifier), we "train" a threshold to discriminate between novel and known organisms and observe its performance on a test set. The threshold that we determine tends to be conservative (low sensitivity but high specificity) for naïve Bayesian methods. Nonetheless, our method performs better with the RDP classifier than the other methods tested, measured by the f-measure and the area-under-the-curve on the receiver operating characteristic of the test set. By constraining the database to well-represented genera, sensitivity improves 3-15%. Finally, we show that the detector is a good predictor to determine novel abundant taxa (especially for finer levels of taxonomy where novelty is more likely to be present).

Conclusions: We conclude that selecting a read-length appropriate RDP bootstrap score can significantly reduce the search space for identifying novel genera and higher levels in taxonomy. In addition, having a well-represented database significantly improves performance while having genera that are "highly" similar does not make a significant improvement. On a real dataset from an Amazon Terra Preta soil sample, we show that the detector can predict (or correlates to) whether novel sequences will be assigned to new taxa when the RDP database "doubles" in the future.

Show MeSH