Limits...
High precision multi-genome scale reannotation of enzyme function by EFICAz.

Arakaki AK, Tian W, Skolnick J - BMC Genomics (2006)

Bottom Line: The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes.Our results suggest that we have generated enzyme function annotations of high precision and recall.These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318, USA. adrian.arakaki@gatech.edu <adrian.arakaki@gatech.edu>

ABSTRACT

Background: The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction.

Results: Based on our three-field EC number predictions, we have obtained lower-bound estimates for the average enzyme content in Archaea (29%), Bacteria (30%) and Eukarya (18%). Most annotations added in KEGG from 2005 to 2006 agree with EFICAz predictions made in 2005. The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes. Thousands of our novel predictions correspond to hypothetical proteins. We have identified a subset of 64 hypothetical proteins with low sequence identity to EFICAz training enzymes, whose biochemical functions have been recently characterized and find that in 96% (84%) of the cases we correctly identified their three-field (four-field) EC numbers. For two of the 64 hypothetical proteins: PA1167 from Pseudomonas aeruginosa, an alginate lyase (EC 4.2.2.3) and Rv1700 of Mycobacterium tuberculosis H37Rv, an ADP-ribose diphosphatase (EC 3.6.1.13), we have detected annotation lag of more than two years in databases. Two examples are presented where EFICAz predictions act as hypothesis generators for understanding the functional roles of hypothetical proteins: FLJ11151, a human protein overexpressed in cancer that EFICAz identifies as an endopolyphosphatase (EC 3.6.1.10), and MW0119, a protein of Staphylococcus aureus strain MW2 that we propose as candidate virulence factor based on its EFICAz predicted activity, sphingomyelin phosphodiesterase (EC 3.1.4.12).

Conclusion: Our results suggest that we have generated enzyme function annotations of high precision and recall. These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.

Show MeSH

Related in: MedlinePlus

Similarity of 64 previously hypothetical proteins to EFICAz training enzymes. Number of previously hypothetical proteins predicted to be enzymes by EFICAz at different intervals of maximal sequence identity to enzymes included in the EFICAz version 5.0 training set. The true enzyme function of these 64 previously hypothetical proteins has been recently determined; therefore, we could assess the precision of our predictions. Dark green, light green and red bars represent four field EC number predictions with four, three or less than three correct EC fields, respectively. Yellow and orange bars represent three field EC number predictions with three or less than three correct EC fields, respectively. The median of the distribution (24.8%) is indicated by the broken line.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1764738&req=5

Figure 3: Similarity of 64 previously hypothetical proteins to EFICAz training enzymes. Number of previously hypothetical proteins predicted to be enzymes by EFICAz at different intervals of maximal sequence identity to enzymes included in the EFICAz version 5.0 training set. The true enzyme function of these 64 previously hypothetical proteins has been recently determined; therefore, we could assess the precision of our predictions. Dark green, light green and red bars represent four field EC number predictions with four, three or less than three correct EC fields, respectively. Yellow and orange bars represent three field EC number predictions with three or less than three correct EC fields, respectively. The median of the distribution (24.8%) is indicated by the broken line.

Mentions: In the previous section, we have shown that a considerable fraction of an average proteome is annotated with at least three-field EC numbers only by EFICAz. An average of 36%, 25% and 12% of the three-field EC number annotations uniquely provided by EFICAz in the archaeal, bacterial and eukaryotic proteomes, respectively, correspond to proteins annotated as hypothetical in KEGG as of 2005. In this section, we assess the EFICAz predictions for a subset of hypothetical proteins for which experimentally-derived enzyme function annotation has recently become available. More precisely, we compare the EFICAz-predicted and the experimentally-derived EC numbers of 64 proteins annotated as hypothetical in KEGG whose enzyme functions we could confidently retrieve from the literature (see Methods for details). For this evaluation, we assume that the true EC number associated to an enzyme is the one derived from the referred experimental results. To exclude cases in which the transfer of functional annotation could be successfully achieved in most cases by simple sequence similarity based methods, we only consider hypothetical proteins whose maximal sequence identity to any of the enzymes we used to train EFICAz is less than 60%. We have previously shown that below this threshold of sequence identity the conservation of enzyme function is on average poor [9]. From the histogram shown in Figure 3, we can observe that the median value of the maximal sequence identity to training enzymes is only 25%.


High precision multi-genome scale reannotation of enzyme function by EFICAz.

Arakaki AK, Tian W, Skolnick J - BMC Genomics (2006)

Similarity of 64 previously hypothetical proteins to EFICAz training enzymes. Number of previously hypothetical proteins predicted to be enzymes by EFICAz at different intervals of maximal sequence identity to enzymes included in the EFICAz version 5.0 training set. The true enzyme function of these 64 previously hypothetical proteins has been recently determined; therefore, we could assess the precision of our predictions. Dark green, light green and red bars represent four field EC number predictions with four, three or less than three correct EC fields, respectively. Yellow and orange bars represent three field EC number predictions with three or less than three correct EC fields, respectively. The median of the distribution (24.8%) is indicated by the broken line.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1764738&req=5

Figure 3: Similarity of 64 previously hypothetical proteins to EFICAz training enzymes. Number of previously hypothetical proteins predicted to be enzymes by EFICAz at different intervals of maximal sequence identity to enzymes included in the EFICAz version 5.0 training set. The true enzyme function of these 64 previously hypothetical proteins has been recently determined; therefore, we could assess the precision of our predictions. Dark green, light green and red bars represent four field EC number predictions with four, three or less than three correct EC fields, respectively. Yellow and orange bars represent three field EC number predictions with three or less than three correct EC fields, respectively. The median of the distribution (24.8%) is indicated by the broken line.
Mentions: In the previous section, we have shown that a considerable fraction of an average proteome is annotated with at least three-field EC numbers only by EFICAz. An average of 36%, 25% and 12% of the three-field EC number annotations uniquely provided by EFICAz in the archaeal, bacterial and eukaryotic proteomes, respectively, correspond to proteins annotated as hypothetical in KEGG as of 2005. In this section, we assess the EFICAz predictions for a subset of hypothetical proteins for which experimentally-derived enzyme function annotation has recently become available. More precisely, we compare the EFICAz-predicted and the experimentally-derived EC numbers of 64 proteins annotated as hypothetical in KEGG whose enzyme functions we could confidently retrieve from the literature (see Methods for details). For this evaluation, we assume that the true EC number associated to an enzyme is the one derived from the referred experimental results. To exclude cases in which the transfer of functional annotation could be successfully achieved in most cases by simple sequence similarity based methods, we only consider hypothetical proteins whose maximal sequence identity to any of the enzymes we used to train EFICAz is less than 60%. We have previously shown that below this threshold of sequence identity the conservation of enzyme function is on average poor [9]. From the histogram shown in Figure 3, we can observe that the median value of the maximal sequence identity to training enzymes is only 25%.

Bottom Line: The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes.Our results suggest that we have generated enzyme function annotations of high precision and recall.These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318, USA. adrian.arakaki@gatech.edu <adrian.arakaki@gatech.edu>

ABSTRACT

Background: The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction.

Results: Based on our three-field EC number predictions, we have obtained lower-bound estimates for the average enzyme content in Archaea (29%), Bacteria (30%) and Eukarya (18%). Most annotations added in KEGG from 2005 to 2006 agree with EFICAz predictions made in 2005. The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes. Thousands of our novel predictions correspond to hypothetical proteins. We have identified a subset of 64 hypothetical proteins with low sequence identity to EFICAz training enzymes, whose biochemical functions have been recently characterized and find that in 96% (84%) of the cases we correctly identified their three-field (four-field) EC numbers. For two of the 64 hypothetical proteins: PA1167 from Pseudomonas aeruginosa, an alginate lyase (EC 4.2.2.3) and Rv1700 of Mycobacterium tuberculosis H37Rv, an ADP-ribose diphosphatase (EC 3.6.1.13), we have detected annotation lag of more than two years in databases. Two examples are presented where EFICAz predictions act as hypothesis generators for understanding the functional roles of hypothetical proteins: FLJ11151, a human protein overexpressed in cancer that EFICAz identifies as an endopolyphosphatase (EC 3.6.1.10), and MW0119, a protein of Staphylococcus aureus strain MW2 that we propose as candidate virulence factor based on its EFICAz predicted activity, sphingomyelin phosphodiesterase (EC 3.1.4.12).

Conclusion: Our results suggest that we have generated enzyme function annotations of high precision and recall. These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.

Show MeSH
Related in: MedlinePlus