Limits...
Phylogenetic approaches to microbial community classification.

Ning J, Beiko RG - Microbiome (2015)

Bottom Line: Custom kernels based on the UniFrac measure of community dissimilarity did not improve performance.However, feature representation was vital to classification accuracy, with microbial clade and function representations providing useful information to the classifier; combining the two types of features did not yield increased prediction accuracy.However, further exploration of the space of both classifiers and feature representations is likely to increase the accuracy of predictive models.

View Article: PubMed Central - PubMed

Affiliation: Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, Nova Scotia, B3H 4R2, Canada. jessie.jning@gmail.com.

ABSTRACT

Background: The microbiota from different body sites are dominated by different major groups of microbes, but the variations within a body site such as the mouth can be more subtle. Accurate predictive models can serve as useful tools for distinguishing sub-sites and understanding key organisms and their roles and can highlight deviations from expected distributions of microbes. Good classification depends on choosing the right combination of classifier, feature representation, and learning model. Machine-learning procedures have been used in the past for supervised classification, but increased attention to feature representation and selection may produce better models and predictions.

Results: We focused our attention on the classification of nine oral sites and dental plaque in particular, using data collected from the Human Microbiome Project. A key focus of our representations was the use of phylogenetic information, both as the basis for custom kernels and as a way to represent sets of microbes to the classifier. We also used the PICRUSt software, which draws on phylogenetic relationships to predict molecular functions and to generate additional features for the classifier. Custom kernels based on the UniFrac measure of community dissimilarity did not improve performance. However, feature representation was vital to classification accuracy, with microbial clade and function representations providing useful information to the classifier; combining the two types of features did not yield increased prediction accuracy. Many of the best-performing clades and functions had clear associations with oral microflora.

Conclusions: The classification of oral microbiota remains a challenging problem; our best accuracy on the plaque dataset was approximately 81 %. Perfect accuracy may be unattainable due to the close proximity of the sites and intra-individual variation. However, further exploration of the space of both classifiers and feature representations is likely to increase the accuracy of predictive models.

No MeSH data available.


Related in: MedlinePlus

Classification of plaque samples using the 1–10 top-ranked features. The heatmap of times that samples were misclassified during 100 replicated runs. The rows correspond to the number of features used and the corresponding accuracy. Each column represents one sample, a yellow bar indicates a sample was misclassified in all 100 replicates, while a black bar indicates that a sample was correctly predicted in all replicates. Only those samples that were misclassified at least once are shown
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4593236&req=5

Fig6: Classification of plaque samples using the 1–10 top-ranked features. The heatmap of times that samples were misclassified during 100 replicated runs. The rows correspond to the number of features used and the corresponding accuracy. Each column represents one sample, a yellow bar indicates a sample was misclassified in all 100 replicates, while a black bar indicates that a sample was correctly predicted in all replicates. Only those samples that were misclassified at least once are shown

Mentions: Since both function and taxonomy yield similar classification accuracy, combinations of the two types of feature may further improve accuracy if they contain complementary information. To assess the performance of classifiers based on combined clade and functional information, we performed feature selection on a hybrid data set containing features of both types. The results of feature selection and classification are shown in Table 2. The accuracy obtained from all three types of feature selection was 80.4–80.5 %, and the RF feature permutation approach yielded a maximum accuracy score with only 28 clade-based and 22 functional features. The small improvement in accuracy of the hybrid approach relative to clade-based classification alone (Table 2, Fig. 4) suggests that the functional features do not provide much useful complementary information to taxonomy: the increase of 0.3 % relative to previous wrapper-based results corresponds to only a few additional correctly classified cases. However, the small number of selected clade-based and functional features that yield these accuracy scores allow us to focus on the key attributes that distinguish the two types of plaque. To focus on a smaller set of features, we defined groups of features with high correlations (Spearman r ≥ 0.5) and chose only the highest-ranking or primary feature from each group, thus preserving relevance while reducing feature redundancy. SVMs built from the top ten features showed slightly reduced accuracy in comparison with the overall performance of clade and hybrid features but demonstrated the central importance of a very small number of features (Fig. 6). Intriguingly, a number of samples were correctly classified when the single best feature (a clade of Streptococcus) was used but incorrectly classified when one or more additional features were included.Fig. 6


Phylogenetic approaches to microbial community classification.

Ning J, Beiko RG - Microbiome (2015)

Classification of plaque samples using the 1–10 top-ranked features. The heatmap of times that samples were misclassified during 100 replicated runs. The rows correspond to the number of features used and the corresponding accuracy. Each column represents one sample, a yellow bar indicates a sample was misclassified in all 100 replicates, while a black bar indicates that a sample was correctly predicted in all replicates. Only those samples that were misclassified at least once are shown
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4593236&req=5

Fig6: Classification of plaque samples using the 1–10 top-ranked features. The heatmap of times that samples were misclassified during 100 replicated runs. The rows correspond to the number of features used and the corresponding accuracy. Each column represents one sample, a yellow bar indicates a sample was misclassified in all 100 replicates, while a black bar indicates that a sample was correctly predicted in all replicates. Only those samples that were misclassified at least once are shown
Mentions: Since both function and taxonomy yield similar classification accuracy, combinations of the two types of feature may further improve accuracy if they contain complementary information. To assess the performance of classifiers based on combined clade and functional information, we performed feature selection on a hybrid data set containing features of both types. The results of feature selection and classification are shown in Table 2. The accuracy obtained from all three types of feature selection was 80.4–80.5 %, and the RF feature permutation approach yielded a maximum accuracy score with only 28 clade-based and 22 functional features. The small improvement in accuracy of the hybrid approach relative to clade-based classification alone (Table 2, Fig. 4) suggests that the functional features do not provide much useful complementary information to taxonomy: the increase of 0.3 % relative to previous wrapper-based results corresponds to only a few additional correctly classified cases. However, the small number of selected clade-based and functional features that yield these accuracy scores allow us to focus on the key attributes that distinguish the two types of plaque. To focus on a smaller set of features, we defined groups of features with high correlations (Spearman r ≥ 0.5) and chose only the highest-ranking or primary feature from each group, thus preserving relevance while reducing feature redundancy. SVMs built from the top ten features showed slightly reduced accuracy in comparison with the overall performance of clade and hybrid features but demonstrated the central importance of a very small number of features (Fig. 6). Intriguingly, a number of samples were correctly classified when the single best feature (a clade of Streptococcus) was used but incorrectly classified when one or more additional features were included.Fig. 6

Bottom Line: Custom kernels based on the UniFrac measure of community dissimilarity did not improve performance.However, feature representation was vital to classification accuracy, with microbial clade and function representations providing useful information to the classifier; combining the two types of features did not yield increased prediction accuracy.However, further exploration of the space of both classifiers and feature representations is likely to increase the accuracy of predictive models.

View Article: PubMed Central - PubMed

Affiliation: Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, Nova Scotia, B3H 4R2, Canada. jessie.jning@gmail.com.

ABSTRACT

Background: The microbiota from different body sites are dominated by different major groups of microbes, but the variations within a body site such as the mouth can be more subtle. Accurate predictive models can serve as useful tools for distinguishing sub-sites and understanding key organisms and their roles and can highlight deviations from expected distributions of microbes. Good classification depends on choosing the right combination of classifier, feature representation, and learning model. Machine-learning procedures have been used in the past for supervised classification, but increased attention to feature representation and selection may produce better models and predictions.

Results: We focused our attention on the classification of nine oral sites and dental plaque in particular, using data collected from the Human Microbiome Project. A key focus of our representations was the use of phylogenetic information, both as the basis for custom kernels and as a way to represent sets of microbes to the classifier. We also used the PICRUSt software, which draws on phylogenetic relationships to predict molecular functions and to generate additional features for the classifier. Custom kernels based on the UniFrac measure of community dissimilarity did not improve performance. However, feature representation was vital to classification accuracy, with microbial clade and function representations providing useful information to the classifier; combining the two types of features did not yield increased prediction accuracy. Many of the best-performing clades and functions had clear associations with oral microflora.

Conclusions: The classification of oral microbiota remains a challenging problem; our best accuracy on the plaque dataset was approximately 81 %. Perfect accuracy may be unattainable due to the close proximity of the sites and intra-individual variation. However, further exploration of the space of both classifiers and feature representations is likely to increase the accuracy of predictive models.

No MeSH data available.


Related in: MedlinePlus