Limits...
Comparing K-mer based methods for improved classification of 16S sequences.

Vinje H, Liland KH, Almøy T, Snipen L - BMC Bioinformatics (2015)

Bottom Line: The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the naïve Bayes RDP method.On fragmented sequences the naïve Bayes Multinomial method performed best, significantly better than all other methods.We conclude that no K-mer based method is universally best for classifying both full-length sequences and fragments (reads).

View Article: PubMed Central - PubMed

Affiliation: Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Oslo, N-1432 Ås, Norway. hilde.vinje@nmbu.no.

ABSTRACT

Background: The need for precise and stable taxonomic classification is highly relevant in modern microbiology. Parallel to the explosion in the amount of sequence data accessible, there has also been a shift in focus for classification methods. Previously, alignment-based methods were the most applicable tools. Now, methods based on counting K-mers by sliding windows are the most interesting classification approach with respect to both speed and accuracy. Here, we present a systematic comparison on five different K-mer based classification methods for the 16S rRNA gene. The methods differ from each other both in data usage and modelling strategies. We have based our study on the commonly known and well-used naïve Bayes classifier from the RDP project, and four other methods were implemented and tested on two different data sets, on full-length sequences as well as fragments of typical read-length.

Results: The difference in classification error obtained by the methods seemed to be small, but they were stable and for both data sets tested. The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the naïve Bayes RDP method. On fragmented sequences the naïve Bayes Multinomial method performed best, significantly better than all other methods. For both data sets explored, and on both full-length and fragmented sequences, all the five methods reached an error-plateau.

Conclusions: We conclude that no K-mer based method is universally best for classifying both full-length sequences and fragments (reads). All methods approach an error plateau indicating improved training data is needed to improve classification from here. Classification errors occur most frequent for genera with few sequences present. For improving the taxonomy and testing new classification methods, the need for a better and more universal and robust training data set is crucial.

Show MeSH
Error distribution over class-sizes. The horizontal axis is the class-size (number of sequences in a genus) and the vertical axis is the error percentage averaged over all genera of the same size. The upper panel is the result from Trainingset9 and the lower panel for the SilvaSet. Only class-sizes up to 10 is shown, for larger classes the error-percentages are small
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4487979&req=5

Fig6: Error distribution over class-sizes. The horizontal axis is the class-size (number of sequences in a genus) and the vertical axis is the error percentage averaged over all genera of the same size. The upper panel is the result from Trainingset9 and the lower panel for the SilvaSet. Only class-sizes up to 10 is shown, for larger classes the error-percentages are small

Mentions: To investigate further the effect of genus-size, we have in Fig. 6 plotted the error percentages for sizes two to ten. As expected, small genera had elevated risk of being classified wrongly. Genera of size two means there are two sequences in total, one sequence in the training set and one in the test set. Recognizing a genus based on one previously observed sequence is of course very difficult. The genera with only one sequence present (singletons) are not shown as they always will have 100 % error. The figure shows that more errors were made for genera consisting of few sequences and this skewness in abundance poses a challenge to all statistical learning methods. One may argue that to improve classifications we need better data more than we need better methods, and that a larger data set is not necessarily a better data set. The SilvaSet is three times larger than Trainingset9 but still relatively more errors were made. We agree that better data is essential, but better data and better methods are also interleaved, since no data set is completely independent of methods, and manual curation is certainly no guarantee against classification errors.Fig. 6


Comparing K-mer based methods for improved classification of 16S sequences.

Vinje H, Liland KH, Almøy T, Snipen L - BMC Bioinformatics (2015)

Error distribution over class-sizes. The horizontal axis is the class-size (number of sequences in a genus) and the vertical axis is the error percentage averaged over all genera of the same size. The upper panel is the result from Trainingset9 and the lower panel for the SilvaSet. Only class-sizes up to 10 is shown, for larger classes the error-percentages are small
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4487979&req=5

Fig6: Error distribution over class-sizes. The horizontal axis is the class-size (number of sequences in a genus) and the vertical axis is the error percentage averaged over all genera of the same size. The upper panel is the result from Trainingset9 and the lower panel for the SilvaSet. Only class-sizes up to 10 is shown, for larger classes the error-percentages are small
Mentions: To investigate further the effect of genus-size, we have in Fig. 6 plotted the error percentages for sizes two to ten. As expected, small genera had elevated risk of being classified wrongly. Genera of size two means there are two sequences in total, one sequence in the training set and one in the test set. Recognizing a genus based on one previously observed sequence is of course very difficult. The genera with only one sequence present (singletons) are not shown as they always will have 100 % error. The figure shows that more errors were made for genera consisting of few sequences and this skewness in abundance poses a challenge to all statistical learning methods. One may argue that to improve classifications we need better data more than we need better methods, and that a larger data set is not necessarily a better data set. The SilvaSet is three times larger than Trainingset9 but still relatively more errors were made. We agree that better data is essential, but better data and better methods are also interleaved, since no data set is completely independent of methods, and manual curation is certainly no guarantee against classification errors.Fig. 6

Bottom Line: The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the naïve Bayes RDP method.On fragmented sequences the naïve Bayes Multinomial method performed best, significantly better than all other methods.We conclude that no K-mer based method is universally best for classifying both full-length sequences and fragments (reads).

View Article: PubMed Central - PubMed

Affiliation: Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Oslo, N-1432 Ås, Norway. hilde.vinje@nmbu.no.

ABSTRACT

Background: The need for precise and stable taxonomic classification is highly relevant in modern microbiology. Parallel to the explosion in the amount of sequence data accessible, there has also been a shift in focus for classification methods. Previously, alignment-based methods were the most applicable tools. Now, methods based on counting K-mers by sliding windows are the most interesting classification approach with respect to both speed and accuracy. Here, we present a systematic comparison on five different K-mer based classification methods for the 16S rRNA gene. The methods differ from each other both in data usage and modelling strategies. We have based our study on the commonly known and well-used naïve Bayes classifier from the RDP project, and four other methods were implemented and tested on two different data sets, on full-length sequences as well as fragments of typical read-length.

Results: The difference in classification error obtained by the methods seemed to be small, but they were stable and for both data sets tested. The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the naïve Bayes RDP method. On fragmented sequences the naïve Bayes Multinomial method performed best, significantly better than all other methods. For both data sets explored, and on both full-length and fragmented sequences, all the five methods reached an error-plateau.

Conclusions: We conclude that no K-mer based method is universally best for classifying both full-length sequences and fragments (reads). All methods approach an error plateau indicating improved training data is needed to improve classification from here. Classification errors occur most frequent for genera with few sequences present. For improving the taxonomy and testing new classification methods, the need for a better and more universal and robust training data set is crucial.

Show MeSH