Limits...
High precision multi-genome scale reannotation of enzyme function by EFICAz.

Arakaki AK, Tian W, Skolnick J - BMC Genomics (2006)

Bottom Line: The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes.Our results suggest that we have generated enzyme function annotations of high precision and recall.These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318, USA. adrian.arakaki@gatech.edu <adrian.arakaki@gatech.edu>

ABSTRACT

Background: The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction.

Results: Based on our three-field EC number predictions, we have obtained lower-bound estimates for the average enzyme content in Archaea (29%), Bacteria (30%) and Eukarya (18%). Most annotations added in KEGG from 2005 to 2006 agree with EFICAz predictions made in 2005. The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes. Thousands of our novel predictions correspond to hypothetical proteins. We have identified a subset of 64 hypothetical proteins with low sequence identity to EFICAz training enzymes, whose biochemical functions have been recently characterized and find that in 96% (84%) of the cases we correctly identified their three-field (four-field) EC numbers. For two of the 64 hypothetical proteins: PA1167 from Pseudomonas aeruginosa, an alginate lyase (EC 4.2.2.3) and Rv1700 of Mycobacterium tuberculosis H37Rv, an ADP-ribose diphosphatase (EC 3.6.1.13), we have detected annotation lag of more than two years in databases. Two examples are presented where EFICAz predictions act as hypothesis generators for understanding the functional roles of hypothetical proteins: FLJ11151, a human protein overexpressed in cancer that EFICAz identifies as an endopolyphosphatase (EC 3.6.1.10), and MW0119, a protein of Staphylococcus aureus strain MW2 that we propose as candidate virulence factor based on its EFICAz predicted activity, sphingomyelin phosphodiesterase (EC 3.1.4.12).

Conclusion: Our results suggest that we have generated enzyme function annotations of high precision and recall. These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.

Show MeSH

Related in: MedlinePlus

Comparison of EFICAz predictions with KEGG annotations. Comparison of EFICAz predictions with KEGG annotations from the Genes database of March 5, 2005, Release 33.0+/03–5 (A-B) and of March 7, 2006, Release 37.0+/03–07 (C-D). We analyze two levels of enzyme function description: four-field EC numbers (A, C) and three-field EC numbers (B, D). For all, archaeal, bacterial and eukaryotic genomes we plot the average percentage of enzymatic proteins per genome whose EFICAz-inferred and KEGG-provided annotations at the specified level of detail agree (green columns) or disagree (red columns), and whose enzyme function annotation at the specified level of detail is only provided by EFICAz (blue columns) or by KEGG (yellow columns). The numeric values inserted in each stacked column are the corresponding average percentage of enzymatic proteins per genome +/- the standard deviation.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1764738&req=5

Figure 2: Comparison of EFICAz predictions with KEGG annotations. Comparison of EFICAz predictions with KEGG annotations from the Genes database of March 5, 2005, Release 33.0+/03–5 (A-B) and of March 7, 2006, Release 37.0+/03–07 (C-D). We analyze two levels of enzyme function description: four-field EC numbers (A, C) and three-field EC numbers (B, D). For all, archaeal, bacterial and eukaryotic genomes we plot the average percentage of enzymatic proteins per genome whose EFICAz-inferred and KEGG-provided annotations at the specified level of detail agree (green columns) or disagree (red columns), and whose enzyme function annotation at the specified level of detail is only provided by EFICAz (blue columns) or by KEGG (yellow columns). The numeric values inserted in each stacked column are the corresponding average percentage of enzymatic proteins per genome +/- the standard deviation.

Mentions: To evaluate the level of agreement of EFICAz predictions with other sources of annotation, we compared our enzyme function assignments to those available in the Genes database of KEGG. In general, the quality and completeness of the functional annotation of genomes tend to continuously improve due to the incessant flow of new experimental results and the correction of systematic errors in annotation transfer [13]. To account for the dynamic nature of the functional annotation process, we compare our predictions with annotations from two different releases of the Genes database: (i) 33.0+/03–05 of March 5, 2005, which is contemporary to the sources we employed for training the version of EFICAz used for our multi-genome scale enzyme annotation effort (Fig. 2A, B), and (ii) 37.0+/03–07, released a year later (Fig. 2C, D). We compare the enzyme function annotations at the level of four-field EC numbers (Fig. 2A, C) and three-field EC numbers (Fig. 2B, D), in the latter case, we compare only the first three fields of the annotated EC numbers, whether the fourth field is known or unknown. Besides our EFICAz predictions, we also have the set of KEGG annotations as of 2006 available on our website [32], where the assignments made by EFICAz and/or KEGG can be browsed. The annotations can also be easily selected and retrieved according to species name, level of detail of the enzyme function prediction (four-field or three-field EC numbers), consistency or inconsistency between EFICAz and KEGG assignments, presence of the keywords "hypothetical" or "unknown" in KEGG assignments as of 2005, EC number and gene name.


High precision multi-genome scale reannotation of enzyme function by EFICAz.

Arakaki AK, Tian W, Skolnick J - BMC Genomics (2006)

Comparison of EFICAz predictions with KEGG annotations. Comparison of EFICAz predictions with KEGG annotations from the Genes database of March 5, 2005, Release 33.0+/03–5 (A-B) and of March 7, 2006, Release 37.0+/03–07 (C-D). We analyze two levels of enzyme function description: four-field EC numbers (A, C) and three-field EC numbers (B, D). For all, archaeal, bacterial and eukaryotic genomes we plot the average percentage of enzymatic proteins per genome whose EFICAz-inferred and KEGG-provided annotations at the specified level of detail agree (green columns) or disagree (red columns), and whose enzyme function annotation at the specified level of detail is only provided by EFICAz (blue columns) or by KEGG (yellow columns). The numeric values inserted in each stacked column are the corresponding average percentage of enzymatic proteins per genome +/- the standard deviation.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1764738&req=5

Figure 2: Comparison of EFICAz predictions with KEGG annotations. Comparison of EFICAz predictions with KEGG annotations from the Genes database of March 5, 2005, Release 33.0+/03–5 (A-B) and of March 7, 2006, Release 37.0+/03–07 (C-D). We analyze two levels of enzyme function description: four-field EC numbers (A, C) and three-field EC numbers (B, D). For all, archaeal, bacterial and eukaryotic genomes we plot the average percentage of enzymatic proteins per genome whose EFICAz-inferred and KEGG-provided annotations at the specified level of detail agree (green columns) or disagree (red columns), and whose enzyme function annotation at the specified level of detail is only provided by EFICAz (blue columns) or by KEGG (yellow columns). The numeric values inserted in each stacked column are the corresponding average percentage of enzymatic proteins per genome +/- the standard deviation.
Mentions: To evaluate the level of agreement of EFICAz predictions with other sources of annotation, we compared our enzyme function assignments to those available in the Genes database of KEGG. In general, the quality and completeness of the functional annotation of genomes tend to continuously improve due to the incessant flow of new experimental results and the correction of systematic errors in annotation transfer [13]. To account for the dynamic nature of the functional annotation process, we compare our predictions with annotations from two different releases of the Genes database: (i) 33.0+/03–05 of March 5, 2005, which is contemporary to the sources we employed for training the version of EFICAz used for our multi-genome scale enzyme annotation effort (Fig. 2A, B), and (ii) 37.0+/03–07, released a year later (Fig. 2C, D). We compare the enzyme function annotations at the level of four-field EC numbers (Fig. 2A, C) and three-field EC numbers (Fig. 2B, D), in the latter case, we compare only the first three fields of the annotated EC numbers, whether the fourth field is known or unknown. Besides our EFICAz predictions, we also have the set of KEGG annotations as of 2006 available on our website [32], where the assignments made by EFICAz and/or KEGG can be browsed. The annotations can also be easily selected and retrieved according to species name, level of detail of the enzyme function prediction (four-field or three-field EC numbers), consistency or inconsistency between EFICAz and KEGG assignments, presence of the keywords "hypothetical" or "unknown" in KEGG assignments as of 2005, EC number and gene name.

Bottom Line: The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes.Our results suggest that we have generated enzyme function annotations of high precision and recall.These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318, USA. adrian.arakaki@gatech.edu <adrian.arakaki@gatech.edu>

ABSTRACT

Background: The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction.

Results: Based on our three-field EC number predictions, we have obtained lower-bound estimates for the average enzyme content in Archaea (29%), Bacteria (30%) and Eukarya (18%). Most annotations added in KEGG from 2005 to 2006 agree with EFICAz predictions made in 2005. The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes. Thousands of our novel predictions correspond to hypothetical proteins. We have identified a subset of 64 hypothetical proteins with low sequence identity to EFICAz training enzymes, whose biochemical functions have been recently characterized and find that in 96% (84%) of the cases we correctly identified their three-field (four-field) EC numbers. For two of the 64 hypothetical proteins: PA1167 from Pseudomonas aeruginosa, an alginate lyase (EC 4.2.2.3) and Rv1700 of Mycobacterium tuberculosis H37Rv, an ADP-ribose diphosphatase (EC 3.6.1.13), we have detected annotation lag of more than two years in databases. Two examples are presented where EFICAz predictions act as hypothesis generators for understanding the functional roles of hypothetical proteins: FLJ11151, a human protein overexpressed in cancer that EFICAz identifies as an endopolyphosphatase (EC 3.6.1.10), and MW0119, a protein of Staphylococcus aureus strain MW2 that we propose as candidate virulence factor based on its EFICAz predicted activity, sphingomyelin phosphodiesterase (EC 3.1.4.12).

Conclusion: Our results suggest that we have generated enzyme function annotations of high precision and recall. These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.

Show MeSH
Related in: MedlinePlus