Limits...
Phylogenetic approaches to microbial community classification.

Ning J, Beiko RG - Microbiome (2015)

Bottom Line: Custom kernels based on the UniFrac measure of community dissimilarity did not improve performance.However, feature representation was vital to classification accuracy, with microbial clade and function representations providing useful information to the classifier; combining the two types of features did not yield increased prediction accuracy.However, further exploration of the space of both classifiers and feature representations is likely to increase the accuracy of predictive models.

View Article: PubMed Central - PubMed

Affiliation: Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, Nova Scotia, B3H 4R2, Canada. jessie.jning@gmail.com.

ABSTRACT

Background: The microbiota from different body sites are dominated by different major groups of microbes, but the variations within a body site such as the mouth can be more subtle. Accurate predictive models can serve as useful tools for distinguishing sub-sites and understanding key organisms and their roles and can highlight deviations from expected distributions of microbes. Good classification depends on choosing the right combination of classifier, feature representation, and learning model. Machine-learning procedures have been used in the past for supervised classification, but increased attention to feature representation and selection may produce better models and predictions.

Results: We focused our attention on the classification of nine oral sites and dental plaque in particular, using data collected from the Human Microbiome Project. A key focus of our representations was the use of phylogenetic information, both as the basis for custom kernels and as a way to represent sets of microbes to the classifier. We also used the PICRUSt software, which draws on phylogenetic relationships to predict molecular functions and to generate additional features for the classifier. Custom kernels based on the UniFrac measure of community dissimilarity did not improve performance. However, feature representation was vital to classification accuracy, with microbial clade and function representations providing useful information to the classifier; combining the two types of features did not yield increased prediction accuracy. Many of the best-performing clades and functions had clear associations with oral microflora.

Conclusions: The classification of oral microbiota remains a challenging problem; our best accuracy on the plaque dataset was approximately 81 %. Perfect accuracy may be unattainable due to the close proximity of the sites and intra-individual variation. However, further exploration of the space of both classifiers and feature representations is likely to increase the accuracy of predictive models.

No MeSH data available.


Related in: MedlinePlus

Classification accuracy with different sets of input features. The classification accuracy is shown for sets of 10 to 200 of the top-ranked features according to the information gain (a), chi-square (b), and RF feature permutation (c) criteria. The four types of input features used were OTUs only (orange markers), OTUs and clades (green markers), functional predictions made using PICRUSt (purple markers), and all generated features (blue markers)
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4593236&req=5

Fig4: Classification accuracy with different sets of input features. The classification accuracy is shown for sets of 10 to 200 of the top-ranked features according to the information gain (a), chi-square (b), and RF feature permutation (c) criteria. The four types of input features used were OTUs only (orange markers), OTUs and clades (green markers), functional predictions made using PICRUSt (purple markers), and all generated features (blue markers)

Mentions: Since the beta-diversity custom kernels did not improve the classification accuracy, we used the generic RBF kernel in all subsequent trials. We next augmented the OTU table with relative abundance information about clades that contain multiple OTUs, to determine whether explicit specification of relationships among OTUs might lead to better prediction accuracy. Fifty-two OTUs were lost because their corresponding sequences failed the PyNast quality control filters, leaving a total of 6996 OTUs. To this set, we added 6994 clades, corresponding to all internal nodes in the reference tree, minus the uninformative root node which always has a relative abundance of 1.0. The classification accuracy obtained without feature selection was less than that obtained from the OTU table without clade information (73.8 vs. 76.2 %). While the OTU + clade table has almost twice as many features than the OTU abundance table alone and includes over 99 % of the original OTUs (Additional file 6), it appears that the higher dimensionality of the data confounds the SVM classifier, making it more difficult to build an accurate model. However, applying feature selection as above gave at least 80 % accuracy (Table 2). As was observed previously with the OTU table, the filter methods required more features to achieve their maximum classification accuracy (110 and 170 for information gain and chi-square versus 100 features for RF approach). Figure 4 shows the performance of classifiers with different numbers of features. Information gain (Fig. 4a) and chi-square (Fig. 4b) had similar performance: the accuracy of OTU abundance varied between 76 and 78 % with different numbers of features. However, clade abundance gave accuracy scores that were often in excess of 80 %. Both OTU and clade abundance can classify samples well with a small number of RF-ranked features (Fig. 4c), but with the number of features increasing, the performance of OTU abundance worsened whereas clade abundance kept working well. It appears that explicitly modeling the phylogenetic correlations between OTUs allows the filter methods to exploit the interactions that were previously accessible only to the wrapper method.Fig. 4


Phylogenetic approaches to microbial community classification.

Ning J, Beiko RG - Microbiome (2015)

Classification accuracy with different sets of input features. The classification accuracy is shown for sets of 10 to 200 of the top-ranked features according to the information gain (a), chi-square (b), and RF feature permutation (c) criteria. The four types of input features used were OTUs only (orange markers), OTUs and clades (green markers), functional predictions made using PICRUSt (purple markers), and all generated features (blue markers)
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4593236&req=5

Fig4: Classification accuracy with different sets of input features. The classification accuracy is shown for sets of 10 to 200 of the top-ranked features according to the information gain (a), chi-square (b), and RF feature permutation (c) criteria. The four types of input features used were OTUs only (orange markers), OTUs and clades (green markers), functional predictions made using PICRUSt (purple markers), and all generated features (blue markers)
Mentions: Since the beta-diversity custom kernels did not improve the classification accuracy, we used the generic RBF kernel in all subsequent trials. We next augmented the OTU table with relative abundance information about clades that contain multiple OTUs, to determine whether explicit specification of relationships among OTUs might lead to better prediction accuracy. Fifty-two OTUs were lost because their corresponding sequences failed the PyNast quality control filters, leaving a total of 6996 OTUs. To this set, we added 6994 clades, corresponding to all internal nodes in the reference tree, minus the uninformative root node which always has a relative abundance of 1.0. The classification accuracy obtained without feature selection was less than that obtained from the OTU table without clade information (73.8 vs. 76.2 %). While the OTU + clade table has almost twice as many features than the OTU abundance table alone and includes over 99 % of the original OTUs (Additional file 6), it appears that the higher dimensionality of the data confounds the SVM classifier, making it more difficult to build an accurate model. However, applying feature selection as above gave at least 80 % accuracy (Table 2). As was observed previously with the OTU table, the filter methods required more features to achieve their maximum classification accuracy (110 and 170 for information gain and chi-square versus 100 features for RF approach). Figure 4 shows the performance of classifiers with different numbers of features. Information gain (Fig. 4a) and chi-square (Fig. 4b) had similar performance: the accuracy of OTU abundance varied between 76 and 78 % with different numbers of features. However, clade abundance gave accuracy scores that were often in excess of 80 %. Both OTU and clade abundance can classify samples well with a small number of RF-ranked features (Fig. 4c), but with the number of features increasing, the performance of OTU abundance worsened whereas clade abundance kept working well. It appears that explicitly modeling the phylogenetic correlations between OTUs allows the filter methods to exploit the interactions that were previously accessible only to the wrapper method.Fig. 4

Bottom Line: Custom kernels based on the UniFrac measure of community dissimilarity did not improve performance.However, feature representation was vital to classification accuracy, with microbial clade and function representations providing useful information to the classifier; combining the two types of features did not yield increased prediction accuracy.However, further exploration of the space of both classifiers and feature representations is likely to increase the accuracy of predictive models.

View Article: PubMed Central - PubMed

Affiliation: Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, Nova Scotia, B3H 4R2, Canada. jessie.jning@gmail.com.

ABSTRACT

Background: The microbiota from different body sites are dominated by different major groups of microbes, but the variations within a body site such as the mouth can be more subtle. Accurate predictive models can serve as useful tools for distinguishing sub-sites and understanding key organisms and their roles and can highlight deviations from expected distributions of microbes. Good classification depends on choosing the right combination of classifier, feature representation, and learning model. Machine-learning procedures have been used in the past for supervised classification, but increased attention to feature representation and selection may produce better models and predictions.

Results: We focused our attention on the classification of nine oral sites and dental plaque in particular, using data collected from the Human Microbiome Project. A key focus of our representations was the use of phylogenetic information, both as the basis for custom kernels and as a way to represent sets of microbes to the classifier. We also used the PICRUSt software, which draws on phylogenetic relationships to predict molecular functions and to generate additional features for the classifier. Custom kernels based on the UniFrac measure of community dissimilarity did not improve performance. However, feature representation was vital to classification accuracy, with microbial clade and function representations providing useful information to the classifier; combining the two types of features did not yield increased prediction accuracy. Many of the best-performing clades and functions had clear associations with oral microflora.

Conclusions: The classification of oral microbiota remains a challenging problem; our best accuracy on the plaque dataset was approximately 81 %. Perfect accuracy may be unattainable due to the close proximity of the sites and intra-individual variation. However, further exploration of the space of both classifiers and feature representations is likely to increase the accuracy of predictive models.

No MeSH data available.


Related in: MedlinePlus