Limits...
A supervised learning approach for taxonomic classification of core-photosystem-II genes and transcripts in the marine environment.

Tzahor S, Man-Aharonovich D, Kirkup BC, Yogev T, Berman-Frank I, Polz MF, Béjà O, Mandel-Gutfreund Y - BMC Genomics (2009)

Bottom Line: Recently, genes encoding the photosystem II reaction center (psbA and psbD) were found in cyanophage genomes.Our method has been applied successfully to classify core-photosystem-II gene fragments, including partial sequences coming directly from the ocean, to seven different taxonomic classes.The ability to accurately classify the origin of individual genes and transcripts coming directly from the environment is of great importance in studying marine ecology.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Biology, Technion - Israel Institute of Technology, Haifa, Israel. shani_tz@yahoo.com

ABSTRACT

Background: Cyanobacteria of the genera Synechococcus and Prochlorococcus play a key role in marine photosynthesis, which contributes to the global carbon cycle and to the world oxygen supply. Recently, genes encoding the photosystem II reaction center (psbA and psbD) were found in cyanophage genomes. This phenomenon suggested that the horizontal transfer of these genes may be involved in increasing phage fitness. To date, a very small percentage of marine bacteria and phages has been cultured. Thus, mapping genomic data extracted directly from the environment to its taxonomic origin is necessary for a better understanding of phage-host relationships and dynamics.

Results: To achieve an accurate and rapid taxonomic classification, we employed a computational approach combining a multi-class Support Vector Machine (SVM) with a codon usage position specific scoring matrix (cuPSSM). Our method has been applied successfully to classify core-photosystem-II gene fragments, including partial sequences coming directly from the ocean, to seven different taxonomic classes. Applying the method on a large set of DNA and RNA psbA clones from the Mediterranean Sea, we studied the distribution of cyanobacterial psbA genes and transcripts in their natural environment. Using our approach, we were able to simultaneously examine taxonomic and ecological distributions in the marine environment.

Conclusion: The ability to accurately classify the origin of individual genes and transcripts coming directly from the environment is of great importance in studying marine ecology. The classification method presented in this paper could be applied further to classify other genes amplified from the environment, for which training data is available.

Show MeSH
ROC plot demonstrating the SVM performance for classifying host vs. phage psbA trusted sequences from GOS [11]based on genomic composition: mononucleotides (gray line), dinucleotides (green line), trinucleotides (blue line), tetranucleotides (red lines), all oligonucleotides (black lines). AUC (Area Under Curve) values are given in Table 1. As illustrated, best performance was achieved when including only the tetranucleotide composition.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2696472&req=5

Figure 1: ROC plot demonstrating the SVM performance for classifying host vs. phage psbA trusted sequences from GOS [11]based on genomic composition: mononucleotides (gray line), dinucleotides (green line), trinucleotides (blue line), tetranucleotides (red lines), all oligonucleotides (black lines). AUC (Area Under Curve) values are given in Table 1. As illustrated, best performance was achieved when including only the tetranucleotide composition.

Mentions: Our SVM training set was derived from 11 and 15 cultured Prochlorococcus host and phage psbA sequences, respectively. In addition, 28 and 16 Synechococcus host and phage sequences were extracted from cultured data [See Additional file 1]. As a first step, we tested the method on an independent set of 280 psbA fragments from the GOS dataset [11]. Applying the SVM classifier based on nucleotide composition, each fragment in the testing set was labeled as Prochlorococcus bacteria, Prochlorococcus-like phage Synechococcus bacteria and Synechococcus-like phage. To evaluate the accuracy (sensitivity and specificity) of the SVM, we compared our results to the original classification derived from phylogenetic analysis, based on DNA trees[15], and neighboring genes [12] [See Additional file 2]. Overall, when combining the frequency of mono-, di-, tri- and tetranucleotides in our feature vector, we correctly classified 93% and 100% of the Prochlorococcus and Synechococcus sequences, respectively. Detailed results for Prochlorococcus classifications are given in Table 1 and illustrated in Figure 1.


A supervised learning approach for taxonomic classification of core-photosystem-II genes and transcripts in the marine environment.

Tzahor S, Man-Aharonovich D, Kirkup BC, Yogev T, Berman-Frank I, Polz MF, Béjà O, Mandel-Gutfreund Y - BMC Genomics (2009)

ROC plot demonstrating the SVM performance for classifying host vs. phage psbA trusted sequences from GOS [11]based on genomic composition: mononucleotides (gray line), dinucleotides (green line), trinucleotides (blue line), tetranucleotides (red lines), all oligonucleotides (black lines). AUC (Area Under Curve) values are given in Table 1. As illustrated, best performance was achieved when including only the tetranucleotide composition.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2696472&req=5

Figure 1: ROC plot demonstrating the SVM performance for classifying host vs. phage psbA trusted sequences from GOS [11]based on genomic composition: mononucleotides (gray line), dinucleotides (green line), trinucleotides (blue line), tetranucleotides (red lines), all oligonucleotides (black lines). AUC (Area Under Curve) values are given in Table 1. As illustrated, best performance was achieved when including only the tetranucleotide composition.
Mentions: Our SVM training set was derived from 11 and 15 cultured Prochlorococcus host and phage psbA sequences, respectively. In addition, 28 and 16 Synechococcus host and phage sequences were extracted from cultured data [See Additional file 1]. As a first step, we tested the method on an independent set of 280 psbA fragments from the GOS dataset [11]. Applying the SVM classifier based on nucleotide composition, each fragment in the testing set was labeled as Prochlorococcus bacteria, Prochlorococcus-like phage Synechococcus bacteria and Synechococcus-like phage. To evaluate the accuracy (sensitivity and specificity) of the SVM, we compared our results to the original classification derived from phylogenetic analysis, based on DNA trees[15], and neighboring genes [12] [See Additional file 2]. Overall, when combining the frequency of mono-, di-, tri- and tetranucleotides in our feature vector, we correctly classified 93% and 100% of the Prochlorococcus and Synechococcus sequences, respectively. Detailed results for Prochlorococcus classifications are given in Table 1 and illustrated in Figure 1.

Bottom Line: Recently, genes encoding the photosystem II reaction center (psbA and psbD) were found in cyanophage genomes.Our method has been applied successfully to classify core-photosystem-II gene fragments, including partial sequences coming directly from the ocean, to seven different taxonomic classes.The ability to accurately classify the origin of individual genes and transcripts coming directly from the environment is of great importance in studying marine ecology.

View Article: PubMed Central - HTML - PubMed

Affiliation: Faculty of Biology, Technion - Israel Institute of Technology, Haifa, Israel. shani_tz@yahoo.com

ABSTRACT

Background: Cyanobacteria of the genera Synechococcus and Prochlorococcus play a key role in marine photosynthesis, which contributes to the global carbon cycle and to the world oxygen supply. Recently, genes encoding the photosystem II reaction center (psbA and psbD) were found in cyanophage genomes. This phenomenon suggested that the horizontal transfer of these genes may be involved in increasing phage fitness. To date, a very small percentage of marine bacteria and phages has been cultured. Thus, mapping genomic data extracted directly from the environment to its taxonomic origin is necessary for a better understanding of phage-host relationships and dynamics.

Results: To achieve an accurate and rapid taxonomic classification, we employed a computational approach combining a multi-class Support Vector Machine (SVM) with a codon usage position specific scoring matrix (cuPSSM). Our method has been applied successfully to classify core-photosystem-II gene fragments, including partial sequences coming directly from the ocean, to seven different taxonomic classes. Applying the method on a large set of DNA and RNA psbA clones from the Mediterranean Sea, we studied the distribution of cyanobacterial psbA genes and transcripts in their natural environment. Using our approach, we were able to simultaneously examine taxonomic and ecological distributions in the marine environment.

Conclusion: The ability to accurately classify the origin of individual genes and transcripts coming directly from the environment is of great importance in studying marine ecology. The classification method presented in this paper could be applied further to classify other genes amplified from the environment, for which training data is available.

Show MeSH