Limits...
DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe.

Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T - BMC Bioinformatics (2015)

Bottom Line: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database.Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL.Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. wtm12@mails.tsinghua.edu.cn.

ABSTRACT

Background: Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature-based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.

Results: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.

Conclusions: Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Show MeSH
Expansion of enzyme space in UniProt-TrEMBL and KEGG by DomSign (specificity threshold = 80%). (A) Expansion of enzyme space in UniProt-TrEMBL. The circles illustrate the distribution of three kinds of proteins in the TrEMBL database. Blue: enzymes already tagged with EC numbers in TrEMBL; red: novel enzymes exclusively predicted by DomSign; light orange: other proteins without EC numbers. The column on the right represents the ratio of EC hierarchy levels among predicted novel enzymes by DomSign. Straight line: predicted enzymes annotated as EC = x.-.-.-; blank: annotated as EC = x.x.-.-; dot: annotated as EC = x.x.x.-; slash: annotated as EC = x.x.x.x. (B) Expansion of enzyme space in KEGG. Each blue dot represents the original enzyme ratio for one particular bacteria genome in KEGG. Each red dot represents the total enzyme ratio for one particular bacteria genome after DomSign annotation. In total, 2,584 bacterial genomes were tested.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4389672&req=5

Fig5: Expansion of enzyme space in UniProt-TrEMBL and KEGG by DomSign (specificity threshold = 80%). (A) Expansion of enzyme space in UniProt-TrEMBL. The circles illustrate the distribution of three kinds of proteins in the TrEMBL database. Blue: enzymes already tagged with EC numbers in TrEMBL; red: novel enzymes exclusively predicted by DomSign; light orange: other proteins without EC numbers. The column on the right represents the ratio of EC hierarchy levels among predicted novel enzymes by DomSign. Straight line: predicted enzymes annotated as EC = x.-.-.-; blank: annotated as EC = x.x.-.-; dot: annotated as EC = x.x.x.-; slash: annotated as EC = x.x.x.x. (B) Expansion of enzyme space in KEGG. Each blue dot represents the original enzyme ratio for one particular bacteria genome in KEGG. Each red dot represents the total enzyme ratio for one particular bacteria genome after DomSign annotation. In total, 2,584 bacterial genomes were tested.

Mentions: Thus, we extended our data mining by predicting enzymes with EC numbers from all of the TrEMBL proteins. The annotation result is presented in Additional file 7. Approximately 3.9 million proteins lacking an EC number could be annotated with an EC number, and the majority of these belong to the three- or four-EC-digit group (Figure 5A). Even with a specificity threshold of 99%, the number of predicted novel enzymes was still around 3.6 million (Additional file 8), further indicating the reliability of this method. By this means, we successfully raised the EC-tagged enzyme ratio from the original 12% to ~30% in TrEMBL (Figure 5A) with high precision. To further illustrate the significance of this EC resource expansion, the increased EC-tagged enzyme ratios for every genome of the bacterial taxonomy in KEGG were calculated and are presented in Figure 5B (see Additional file 9 for detailed bacterial EC number annotations in KEGG). Remarkably, on average, we raised the EC-tagged enzyme ratio of each bacterial genome from the previous 26.0% to 33.2% for 2,584 bacterial genomes in KEGG, implying that the DomSign enzyme prediction method can provide deeper insight into the metabolism of many sequenced but insufficiently characterized organisms. Taken together, DomSign enzyme predictions in TrEMBL and KEGG increased the number of EC-labeled enzymes with precision and confirmed the existence of hypothetical gaps between the real enzyme space and the functional annotation.Figure 5


DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe.

Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T - BMC Bioinformatics (2015)

Expansion of enzyme space in UniProt-TrEMBL and KEGG by DomSign (specificity threshold = 80%). (A) Expansion of enzyme space in UniProt-TrEMBL. The circles illustrate the distribution of three kinds of proteins in the TrEMBL database. Blue: enzymes already tagged with EC numbers in TrEMBL; red: novel enzymes exclusively predicted by DomSign; light orange: other proteins without EC numbers. The column on the right represents the ratio of EC hierarchy levels among predicted novel enzymes by DomSign. Straight line: predicted enzymes annotated as EC = x.-.-.-; blank: annotated as EC = x.x.-.-; dot: annotated as EC = x.x.x.-; slash: annotated as EC = x.x.x.x. (B) Expansion of enzyme space in KEGG. Each blue dot represents the original enzyme ratio for one particular bacteria genome in KEGG. Each red dot represents the total enzyme ratio for one particular bacteria genome after DomSign annotation. In total, 2,584 bacterial genomes were tested.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4389672&req=5

Fig5: Expansion of enzyme space in UniProt-TrEMBL and KEGG by DomSign (specificity threshold = 80%). (A) Expansion of enzyme space in UniProt-TrEMBL. The circles illustrate the distribution of three kinds of proteins in the TrEMBL database. Blue: enzymes already tagged with EC numbers in TrEMBL; red: novel enzymes exclusively predicted by DomSign; light orange: other proteins without EC numbers. The column on the right represents the ratio of EC hierarchy levels among predicted novel enzymes by DomSign. Straight line: predicted enzymes annotated as EC = x.-.-.-; blank: annotated as EC = x.x.-.-; dot: annotated as EC = x.x.x.-; slash: annotated as EC = x.x.x.x. (B) Expansion of enzyme space in KEGG. Each blue dot represents the original enzyme ratio for one particular bacteria genome in KEGG. Each red dot represents the total enzyme ratio for one particular bacteria genome after DomSign annotation. In total, 2,584 bacterial genomes were tested.
Mentions: Thus, we extended our data mining by predicting enzymes with EC numbers from all of the TrEMBL proteins. The annotation result is presented in Additional file 7. Approximately 3.9 million proteins lacking an EC number could be annotated with an EC number, and the majority of these belong to the three- or four-EC-digit group (Figure 5A). Even with a specificity threshold of 99%, the number of predicted novel enzymes was still around 3.6 million (Additional file 8), further indicating the reliability of this method. By this means, we successfully raised the EC-tagged enzyme ratio from the original 12% to ~30% in TrEMBL (Figure 5A) with high precision. To further illustrate the significance of this EC resource expansion, the increased EC-tagged enzyme ratios for every genome of the bacterial taxonomy in KEGG were calculated and are presented in Figure 5B (see Additional file 9 for detailed bacterial EC number annotations in KEGG). Remarkably, on average, we raised the EC-tagged enzyme ratio of each bacterial genome from the previous 26.0% to 33.2% for 2,584 bacterial genomes in KEGG, implying that the DomSign enzyme prediction method can provide deeper insight into the metabolism of many sequenced but insufficiently characterized organisms. Taken together, DomSign enzyme predictions in TrEMBL and KEGG increased the number of EC-labeled enzymes with precision and confirmed the existence of hypothetical gaps between the real enzyme space and the functional annotation.Figure 5

Bottom Line: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database.Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL.Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. wtm12@mails.tsinghua.edu.cn.

ABSTRACT

Background: Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature-based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.

Results: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.

Conclusions: Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Show MeSH