Limits...
High precision multi-genome scale reannotation of enzyme function by EFICAz.

Arakaki AK, Tian W, Skolnick J - BMC Genomics (2006)

Bottom Line: The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes.Our results suggest that we have generated enzyme function annotations of high precision and recall.These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318, USA. adrian.arakaki@gatech.edu <adrian.arakaki@gatech.edu>

ABSTRACT

Background: The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction.

Results: Based on our three-field EC number predictions, we have obtained lower-bound estimates for the average enzyme content in Archaea (29%), Bacteria (30%) and Eukarya (18%). Most annotations added in KEGG from 2005 to 2006 agree with EFICAz predictions made in 2005. The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes. Thousands of our novel predictions correspond to hypothetical proteins. We have identified a subset of 64 hypothetical proteins with low sequence identity to EFICAz training enzymes, whose biochemical functions have been recently characterized and find that in 96% (84%) of the cases we correctly identified their three-field (four-field) EC numbers. For two of the 64 hypothetical proteins: PA1167 from Pseudomonas aeruginosa, an alginate lyase (EC 4.2.2.3) and Rv1700 of Mycobacterium tuberculosis H37Rv, an ADP-ribose diphosphatase (EC 3.6.1.13), we have detected annotation lag of more than two years in databases. Two examples are presented where EFICAz predictions act as hypothesis generators for understanding the functional roles of hypothetical proteins: FLJ11151, a human protein overexpressed in cancer that EFICAz identifies as an endopolyphosphatase (EC 3.6.1.10), and MW0119, a protein of Staphylococcus aureus strain MW2 that we propose as candidate virulence factor based on its EFICAz predicted activity, sphingomyelin phosphodiesterase (EC 3.1.4.12).

Conclusion: Our results suggest that we have generated enzyme function annotations of high precision and recall. These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.

Show MeSH

Related in: MedlinePlus

Benchmark test of updated versions of EFICAz. Precision (A-C), recall (D-F) and number of enzyme types described by four-field EC numbers (G-I) for different versions of EFICAz, at different levels of maximal testing to training sequence identity, averaged per enzyme type. Curves in red correspond to enzyme types for which at least 10 training sequences were available; curves in blue correspond to all enzyme types. The training of versions 2.0, 3.0 and 4.0 of EFICAz is based on the Releases 2.0, 3.0 and 4.0 of UniProt, respectively. The new Swiss-Prot sequences added to UniProt 5.0 since the release of UniProt 2.0, 3.0 and 4.0 constitute the test sequences for versions 2.0, 3.0 and 4.0 of EFICAz. See Methods for a full description of the benchmark procedure.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1764738&req=5

Figure 4: Benchmark test of updated versions of EFICAz. Precision (A-C), recall (D-F) and number of enzyme types described by four-field EC numbers (G-I) for different versions of EFICAz, at different levels of maximal testing to training sequence identity, averaged per enzyme type. Curves in red correspond to enzyme types for which at least 10 training sequences were available; curves in blue correspond to all enzyme types. The training of versions 2.0, 3.0 and 4.0 of EFICAz is based on the Releases 2.0, 3.0 and 4.0 of UniProt, respectively. The new Swiss-Prot sequences added to UniProt 5.0 since the release of UniProt 2.0, 3.0 and 4.0 constitute the test sequences for versions 2.0, 3.0 and 4.0 of EFICAz. See Methods for a full description of the benchmark procedure.

Mentions: Precision was the highest priority of our analysis; accordingly, our results suggest that by using EFICAz [26], we have generated annotations of good quality. First, the comprehensive series of benchmarks of EFICAz show that we can expect a mean precision of 94% regardless of the sequence similarity between testing and training enzymes (Figure 4A–C). Second, by comparing our predictions with KEGG annotations available a year later (which can take advantage of updated databases and new experimental results available in the literature), we find that most of the newly added KEGG enzyme function annotations agreed with our earlier EFICAz predictions (Figure 2). Third, by way of illustration, we identified a set of 64 previously hypothetical proteins whose biochemical functions have been recently characterized and found that in 96% of the cases, we correctly identified their three-field EC numbers, and in 84% of the cases, we could provide their fully detailed enzymatic activities (Tables 1, 2, 3). Achieving this level of precision is not trivial, considering that: (i) hypothetical proteins are the most difficult targets for automated function prediction [77], and (ii) the maximal sequence identity between the 64 hypothetical proteins and the EFICAz training enzymes has a median value of 25% (Figure 3). We were surprised to find a few cases among this set of 64 hypothetical proteins, where the annotation lag in databases was more than two years. It is difficult to estimate the full dimension of this problem; nevertheless, a systematic rescue of those annotations lost in the literature is very much needed, given the low number of experimentally verified functional assignments in the current databases [78].


High precision multi-genome scale reannotation of enzyme function by EFICAz.

Arakaki AK, Tian W, Skolnick J - BMC Genomics (2006)

Benchmark test of updated versions of EFICAz. Precision (A-C), recall (D-F) and number of enzyme types described by four-field EC numbers (G-I) for different versions of EFICAz, at different levels of maximal testing to training sequence identity, averaged per enzyme type. Curves in red correspond to enzyme types for which at least 10 training sequences were available; curves in blue correspond to all enzyme types. The training of versions 2.0, 3.0 and 4.0 of EFICAz is based on the Releases 2.0, 3.0 and 4.0 of UniProt, respectively. The new Swiss-Prot sequences added to UniProt 5.0 since the release of UniProt 2.0, 3.0 and 4.0 constitute the test sequences for versions 2.0, 3.0 and 4.0 of EFICAz. See Methods for a full description of the benchmark procedure.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1764738&req=5

Figure 4: Benchmark test of updated versions of EFICAz. Precision (A-C), recall (D-F) and number of enzyme types described by four-field EC numbers (G-I) for different versions of EFICAz, at different levels of maximal testing to training sequence identity, averaged per enzyme type. Curves in red correspond to enzyme types for which at least 10 training sequences were available; curves in blue correspond to all enzyme types. The training of versions 2.0, 3.0 and 4.0 of EFICAz is based on the Releases 2.0, 3.0 and 4.0 of UniProt, respectively. The new Swiss-Prot sequences added to UniProt 5.0 since the release of UniProt 2.0, 3.0 and 4.0 constitute the test sequences for versions 2.0, 3.0 and 4.0 of EFICAz. See Methods for a full description of the benchmark procedure.
Mentions: Precision was the highest priority of our analysis; accordingly, our results suggest that by using EFICAz [26], we have generated annotations of good quality. First, the comprehensive series of benchmarks of EFICAz show that we can expect a mean precision of 94% regardless of the sequence similarity between testing and training enzymes (Figure 4A–C). Second, by comparing our predictions with KEGG annotations available a year later (which can take advantage of updated databases and new experimental results available in the literature), we find that most of the newly added KEGG enzyme function annotations agreed with our earlier EFICAz predictions (Figure 2). Third, by way of illustration, we identified a set of 64 previously hypothetical proteins whose biochemical functions have been recently characterized and found that in 96% of the cases, we correctly identified their three-field EC numbers, and in 84% of the cases, we could provide their fully detailed enzymatic activities (Tables 1, 2, 3). Achieving this level of precision is not trivial, considering that: (i) hypothetical proteins are the most difficult targets for automated function prediction [77], and (ii) the maximal sequence identity between the 64 hypothetical proteins and the EFICAz training enzymes has a median value of 25% (Figure 3). We were surprised to find a few cases among this set of 64 hypothetical proteins, where the annotation lag in databases was more than two years. It is difficult to estimate the full dimension of this problem; nevertheless, a systematic rescue of those annotations lost in the literature is very much needed, given the low number of experimentally verified functional assignments in the current databases [78].

Bottom Line: The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes.Our results suggest that we have generated enzyme function annotations of high precision and recall.These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318, USA. adrian.arakaki@gatech.edu <adrian.arakaki@gatech.edu>

ABSTRACT

Background: The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction.

Results: Based on our three-field EC number predictions, we have obtained lower-bound estimates for the average enzyme content in Archaea (29%), Bacteria (30%) and Eukarya (18%). Most annotations added in KEGG from 2005 to 2006 agree with EFICAz predictions made in 2005. The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes. Thousands of our novel predictions correspond to hypothetical proteins. We have identified a subset of 64 hypothetical proteins with low sequence identity to EFICAz training enzymes, whose biochemical functions have been recently characterized and find that in 96% (84%) of the cases we correctly identified their three-field (four-field) EC numbers. For two of the 64 hypothetical proteins: PA1167 from Pseudomonas aeruginosa, an alginate lyase (EC 4.2.2.3) and Rv1700 of Mycobacterium tuberculosis H37Rv, an ADP-ribose diphosphatase (EC 3.6.1.13), we have detected annotation lag of more than two years in databases. Two examples are presented where EFICAz predictions act as hypothesis generators for understanding the functional roles of hypothetical proteins: FLJ11151, a human protein overexpressed in cancer that EFICAz identifies as an endopolyphosphatase (EC 3.6.1.10), and MW0119, a protein of Staphylococcus aureus strain MW2 that we propose as candidate virulence factor based on its EFICAz predicted activity, sphingomyelin phosphodiesterase (EC 3.1.4.12).

Conclusion: Our results suggest that we have generated enzyme function annotations of high precision and recall. These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.

Show MeSH
Related in: MedlinePlus