Limits...
A supervised learning approach for taxonomic classification of core-photosystem-II genes and transcripts in the marine environment.

Tzahor S, Man-Aharonovich D, Kirkup BC, Yogev T, Berman-Frank I, Polz MF, Béjà O, Mandel-Gutfreund Y - BMC Genomics (2009)

Bottom Line: Recently, genes encoding the photosystem II reaction center (psbA and psbD) were found in cyanophage genomes.Our method has been applied successfully to classify core-photosystem-II gene fragments, including partial sequences coming directly from the ocean, to seven different taxonomic classes.The ability to accurately classify the origin of individual genes and transcripts coming directly from the environment is of great importance in studying marine ecology.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Biology, Technion - Israel Institute of Technology, Haifa, Israel. shani_tz@yahoo.com

ABSTRACT

Background: Cyanobacteria of the genera Synechococcus and Prochlorococcus play a key role in marine photosynthesis, which contributes to the global carbon cycle and to the world oxygen supply. Recently, genes encoding the photosystem II reaction center (psbA and psbD) were found in cyanophage genomes. This phenomenon suggested that the horizontal transfer of these genes may be involved in increasing phage fitness. To date, a very small percentage of marine bacteria and phages has been cultured. Thus, mapping genomic data extracted directly from the environment to its taxonomic origin is necessary for a better understanding of phage-host relationships and dynamics.

Results: To achieve an accurate and rapid taxonomic classification, we employed a computational approach combining a multi-class Support Vector Machine (SVM) with a codon usage position specific scoring matrix (cuPSSM). Our method has been applied successfully to classify core-photosystem-II gene fragments, including partial sequences coming directly from the ocean, to seven different taxonomic classes. Applying the method on a large set of DNA and RNA psbA clones from the Mediterranean Sea, we studied the distribution of cyanobacterial psbA genes and transcripts in their natural environment. Using our approach, we were able to simultaneously examine taxonomic and ecological distributions in the marine environment.

Conclusion: The ability to accurately classify the origin of individual genes and transcripts coming directly from the environment is of great importance in studying marine ecology. The classification method presented in this paper could be applied further to classify other genes amplified from the environment, for which training data is available.

Show MeSH
A summary of the psbA sequence classifier. Each sequence is classified by two independent approaches: multi-class SVM (left) and cuPSSM (right). In the multi-class SVM classifier, each sequence is represented by an oligonucleotide frequency vector (calculated in overlapping windows) and tested against seven different SVM classifiers (trained on culture and environmental data from GOS) [See Additional Files 1 and 2]. The sequence is classified based on the classifier in which it achieved the highest positive result. Independently, the sequence is aligned to a template psbA gene and scored against seven different cuPSSMs. The sequence is then classified based on the subgroup for which it achieved the highest score. Finally, the results of the two approaches are compared. Sequences for which the two independent classifiers converged are classified according to the common sub classification. In cases where no agreement exists, the sequence is further classified manually as described in the text.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2696472&req=5

Figure 3: A summary of the psbA sequence classifier. Each sequence is classified by two independent approaches: multi-class SVM (left) and cuPSSM (right). In the multi-class SVM classifier, each sequence is represented by an oligonucleotide frequency vector (calculated in overlapping windows) and tested against seven different SVM classifiers (trained on culture and environmental data from GOS) [See Additional Files 1 and 2]. The sequence is classified based on the classifier in which it achieved the highest positive result. Independently, the sequence is aligned to a template psbA gene and scored against seven different cuPSSMs. The sequence is then classified based on the subgroup for which it achieved the highest score. Finally, the results of the two approaches are compared. Sequences for which the two independent classifiers converged are classified according to the common sub classification. In cases where no agreement exists, the sequence is further classified manually as described in the text.

Mentions: Although the PCA clustering clearly indicates that psbA fragments can be grouped based on their oligonucleotide frequencies, it is not a practical method for the automatic and accurate classification of large amounts of data. Thus, we expanded our binary prediction described above to a multi-class problem. To this end, we built seven new SVM classifiers, one for each taxonomic group. In each classifier, we trained an SVM to separate one of the seven groups from all the other groups using the oligonucleotide frequencies as a feature vector. Subsequently, each fragment was tested on each of the classifiers and was given a discriminate value, denoting the confidence that the fragment belongs to the specific group. Further, the discriminate values produced by each classifier were ranked, and the tested fragment was labeled based on the classifier in which it received the highest positive value. Independently, we built seven cuPSSMs based on the codon usage of the training data. The tested sequences were then aligned to a template psbA gene, scored against seven different cuPSSMs (each sequence was trained on a single taxonomic group) and given a label based on the cuPSSM for which it achieved the highest score. To assure an accurate prediction of the testing data, each fragment was examined against the two classifiers independently, and a final label was acquired only when the results of the two classifiers converged. A summary of our classification methodology is illustrated in Figure 3. As illustrated, when given a sequence, it is independently tested by two classifiers: (i) multi-class SVM and (ii) cuPSSM. Further, the results of the two approaches are combined and compared. Sequences for which the two independent classifiers converged are classified according to the common sub-classification (e.g Prochlorococcus-like Myovirus). In cases where there is no agreement between the two classifiers (i.e. multi-class SVM vs cuPSSM), sequences are considered "un predicted" and are subjected to a manual decision.


A supervised learning approach for taxonomic classification of core-photosystem-II genes and transcripts in the marine environment.

Tzahor S, Man-Aharonovich D, Kirkup BC, Yogev T, Berman-Frank I, Polz MF, Béjà O, Mandel-Gutfreund Y - BMC Genomics (2009)

A summary of the psbA sequence classifier. Each sequence is classified by two independent approaches: multi-class SVM (left) and cuPSSM (right). In the multi-class SVM classifier, each sequence is represented by an oligonucleotide frequency vector (calculated in overlapping windows) and tested against seven different SVM classifiers (trained on culture and environmental data from GOS) [See Additional Files 1 and 2]. The sequence is classified based on the classifier in which it achieved the highest positive result. Independently, the sequence is aligned to a template psbA gene and scored against seven different cuPSSMs. The sequence is then classified based on the subgroup for which it achieved the highest score. Finally, the results of the two approaches are compared. Sequences for which the two independent classifiers converged are classified according to the common sub classification. In cases where no agreement exists, the sequence is further classified manually as described in the text.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2696472&req=5

Figure 3: A summary of the psbA sequence classifier. Each sequence is classified by two independent approaches: multi-class SVM (left) and cuPSSM (right). In the multi-class SVM classifier, each sequence is represented by an oligonucleotide frequency vector (calculated in overlapping windows) and tested against seven different SVM classifiers (trained on culture and environmental data from GOS) [See Additional Files 1 and 2]. The sequence is classified based on the classifier in which it achieved the highest positive result. Independently, the sequence is aligned to a template psbA gene and scored against seven different cuPSSMs. The sequence is then classified based on the subgroup for which it achieved the highest score. Finally, the results of the two approaches are compared. Sequences for which the two independent classifiers converged are classified according to the common sub classification. In cases where no agreement exists, the sequence is further classified manually as described in the text.
Mentions: Although the PCA clustering clearly indicates that psbA fragments can be grouped based on their oligonucleotide frequencies, it is not a practical method for the automatic and accurate classification of large amounts of data. Thus, we expanded our binary prediction described above to a multi-class problem. To this end, we built seven new SVM classifiers, one for each taxonomic group. In each classifier, we trained an SVM to separate one of the seven groups from all the other groups using the oligonucleotide frequencies as a feature vector. Subsequently, each fragment was tested on each of the classifiers and was given a discriminate value, denoting the confidence that the fragment belongs to the specific group. Further, the discriminate values produced by each classifier were ranked, and the tested fragment was labeled based on the classifier in which it received the highest positive value. Independently, we built seven cuPSSMs based on the codon usage of the training data. The tested sequences were then aligned to a template psbA gene, scored against seven different cuPSSMs (each sequence was trained on a single taxonomic group) and given a label based on the cuPSSM for which it achieved the highest score. To assure an accurate prediction of the testing data, each fragment was examined against the two classifiers independently, and a final label was acquired only when the results of the two classifiers converged. A summary of our classification methodology is illustrated in Figure 3. As illustrated, when given a sequence, it is independently tested by two classifiers: (i) multi-class SVM and (ii) cuPSSM. Further, the results of the two approaches are combined and compared. Sequences for which the two independent classifiers converged are classified according to the common sub-classification (e.g Prochlorococcus-like Myovirus). In cases where there is no agreement between the two classifiers (i.e. multi-class SVM vs cuPSSM), sequences are considered "un predicted" and are subjected to a manual decision.

Bottom Line: Recently, genes encoding the photosystem II reaction center (psbA and psbD) were found in cyanophage genomes.Our method has been applied successfully to classify core-photosystem-II gene fragments, including partial sequences coming directly from the ocean, to seven different taxonomic classes.The ability to accurately classify the origin of individual genes and transcripts coming directly from the environment is of great importance in studying marine ecology.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Biology, Technion - Israel Institute of Technology, Haifa, Israel. shani_tz@yahoo.com

ABSTRACT

Background: Cyanobacteria of the genera Synechococcus and Prochlorococcus play a key role in marine photosynthesis, which contributes to the global carbon cycle and to the world oxygen supply. Recently, genes encoding the photosystem II reaction center (psbA and psbD) were found in cyanophage genomes. This phenomenon suggested that the horizontal transfer of these genes may be involved in increasing phage fitness. To date, a very small percentage of marine bacteria and phages has been cultured. Thus, mapping genomic data extracted directly from the environment to its taxonomic origin is necessary for a better understanding of phage-host relationships and dynamics.

Results: To achieve an accurate and rapid taxonomic classification, we employed a computational approach combining a multi-class Support Vector Machine (SVM) with a codon usage position specific scoring matrix (cuPSSM). Our method has been applied successfully to classify core-photosystem-II gene fragments, including partial sequences coming directly from the ocean, to seven different taxonomic classes. Applying the method on a large set of DNA and RNA psbA clones from the Mediterranean Sea, we studied the distribution of cyanobacterial psbA genes and transcripts in their natural environment. Using our approach, we were able to simultaneously examine taxonomic and ecological distributions in the marine environment.

Conclusion: The ability to accurately classify the origin of individual genes and transcripts coming directly from the environment is of great importance in studying marine ecology. The classification method presented in this paper could be applied further to classify other genes amplified from the environment, for which training data is available.

Show MeSH