Limits...
DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe.

Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T - BMC Bioinformatics (2015)

Bottom Line: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database.Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL.Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. wtm12@mails.tsinghua.edu.cn.

ABSTRACT

Background: Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature-based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.

Results: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.

Conclusions: Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Show MeSH
DomSign performance comparison with BLAST and FS models by 1,000-fold cross-validation of “sprot protein”. Three levels of 1,000-fold cross-validations were conducted for each method. Homologs of a query above a given threshold (“identity ≤ 100%”, “identity ≤ 60%” and “identity ≤ 30%” described in Methods) were removed from the reference dataset and, for each reference dataset, only sequences below the given threshold were kept. In this test, an 80% specificity threshold, 10−3 E-value and default parameters were applied to the DomSign, BLASTP and FS models. The relative standard errors were not significant (<1%) and therefore are not illustrated here. (A) Results for the evaluation of the three methods. As shown on the right, four attributes are defined to evaluate the annotation results in contrast to the “true EC number” (see Methods for details). (B) The EC hierarchy level distribution in the annotation results of the three methods. Seven attributes are defined here to describe the annotation results. Among them, “No best hit” is specific to BLASTP. “More than one EC” is specific to the FS model because this dataset encompasses only enzymes with single EC numbers or non-enzymes, and this attribute is regarded as “OP” in Panel A. We integrated the annotation result “Non-enzyme” and “EC = −.-.-.-”, as shown in Figure 2 into one unified group, “Non-enzyme”, in the result’s illustration because the latter has no EC number assigned and also only occupies a small fraction of the annotation results (the ratio of the “EC = −.-.-.-” subclass is only 1.4% in the “identity ≥ 100%” group for DomSign) of the annotation results.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4389672&req=5

Fig3: DomSign performance comparison with BLAST and FS models by 1,000-fold cross-validation of “sprot protein”. Three levels of 1,000-fold cross-validations were conducted for each method. Homologs of a query above a given threshold (“identity ≤ 100%”, “identity ≤ 60%” and “identity ≤ 30%” described in Methods) were removed from the reference dataset and, for each reference dataset, only sequences below the given threshold were kept. In this test, an 80% specificity threshold, 10−3 E-value and default parameters were applied to the DomSign, BLASTP and FS models. The relative standard errors were not significant (<1%) and therefore are not illustrated here. (A) Results for the evaluation of the three methods. As shown on the right, four attributes are defined to evaluate the annotation results in contrast to the “true EC number” (see Methods for details). (B) The EC hierarchy level distribution in the annotation results of the three methods. Seven attributes are defined here to describe the annotation results. Among them, “No best hit” is specific to BLASTP. “More than one EC” is specific to the FS model because this dataset encompasses only enzymes with single EC numbers or non-enzymes, and this attribute is regarded as “OP” in Panel A. We integrated the annotation result “Non-enzyme” and “EC = −.-.-.-”, as shown in Figure 2 into one unified group, “Non-enzyme”, in the result’s illustration because the latter has no EC number assigned and also only occupies a small fraction of the annotation results (the ratio of the “EC = −.-.-.-” subclass is only 1.4% in the “identity ≥ 100%” group for DomSign) of the annotation results.

Mentions: Owing to the top-down nature of our approach, we designed a new result evaluation system to use instead of the widely used recall-precision curve [19] that differentiates the annotation results at different levels, resulting in better resolution. Briefly, the predicted EC number (PE) is compared with the true EC number (TE), and the result is classified into the following groups (Figure 3A, right): E—Equality, PE is the same as TE (“PE: EC = 1.2.3.4” vs. “TE: EC = 1.2.3.4”); OP—Overprediction, there is at least one incorrectly assigned EC digit in PE compared with TE (“PE: EC = 1.2.1.1” vs. “TE: EC = 1.2.3.4”); IA—Insufficient Annotation, PE is correct but not complete compared with TE (“PE: EC = 1.2.-.-” vs. “TE: EC = 1.2.3.4”); and IM—Improvement, TE is the parent family of PE (“PE: EC = 1.2.3.4” vs. “TE: EC = 1.2.3.-”). When TE is “Non-enzyme”, if the PE equals “Non-enzyme”, then the comparison result is set as “Equality”. Otherwise, the result is “Overprediction”. Additionally, if PE is “Non-enzyme” and TE is not, then the comparison result is set as IA. What needs to be specifically mentioned here is IA. Although this result means incomplete annotation, it is correct and does not cause any increase in the error rate. Thus, IA provides better annotation coverage and simultaneously maintains high precision. The evaluation metrics defined here differ from traditional ones [19]. However, compared with previous precision-recall curves that equally consider different EC hierarchy levels, this system covers all the possible situations and also gives an intuitive view of the performance at different annotation levels with higher resolution, which is especially suitable for evaluating annotation results using metrics of a hierarchical structure.Figure 3


DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe.

Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T - BMC Bioinformatics (2015)

DomSign performance comparison with BLAST and FS models by 1,000-fold cross-validation of “sprot protein”. Three levels of 1,000-fold cross-validations were conducted for each method. Homologs of a query above a given threshold (“identity ≤ 100%”, “identity ≤ 60%” and “identity ≤ 30%” described in Methods) were removed from the reference dataset and, for each reference dataset, only sequences below the given threshold were kept. In this test, an 80% specificity threshold, 10−3 E-value and default parameters were applied to the DomSign, BLASTP and FS models. The relative standard errors were not significant (<1%) and therefore are not illustrated here. (A) Results for the evaluation of the three methods. As shown on the right, four attributes are defined to evaluate the annotation results in contrast to the “true EC number” (see Methods for details). (B) The EC hierarchy level distribution in the annotation results of the three methods. Seven attributes are defined here to describe the annotation results. Among them, “No best hit” is specific to BLASTP. “More than one EC” is specific to the FS model because this dataset encompasses only enzymes with single EC numbers or non-enzymes, and this attribute is regarded as “OP” in Panel A. We integrated the annotation result “Non-enzyme” and “EC = −.-.-.-”, as shown in Figure 2 into one unified group, “Non-enzyme”, in the result’s illustration because the latter has no EC number assigned and also only occupies a small fraction of the annotation results (the ratio of the “EC = −.-.-.-” subclass is only 1.4% in the “identity ≥ 100%” group for DomSign) of the annotation results.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4389672&req=5

Fig3: DomSign performance comparison with BLAST and FS models by 1,000-fold cross-validation of “sprot protein”. Three levels of 1,000-fold cross-validations were conducted for each method. Homologs of a query above a given threshold (“identity ≤ 100%”, “identity ≤ 60%” and “identity ≤ 30%” described in Methods) were removed from the reference dataset and, for each reference dataset, only sequences below the given threshold were kept. In this test, an 80% specificity threshold, 10−3 E-value and default parameters were applied to the DomSign, BLASTP and FS models. The relative standard errors were not significant (<1%) and therefore are not illustrated here. (A) Results for the evaluation of the three methods. As shown on the right, four attributes are defined to evaluate the annotation results in contrast to the “true EC number” (see Methods for details). (B) The EC hierarchy level distribution in the annotation results of the three methods. Seven attributes are defined here to describe the annotation results. Among them, “No best hit” is specific to BLASTP. “More than one EC” is specific to the FS model because this dataset encompasses only enzymes with single EC numbers or non-enzymes, and this attribute is regarded as “OP” in Panel A. We integrated the annotation result “Non-enzyme” and “EC = −.-.-.-”, as shown in Figure 2 into one unified group, “Non-enzyme”, in the result’s illustration because the latter has no EC number assigned and also only occupies a small fraction of the annotation results (the ratio of the “EC = −.-.-.-” subclass is only 1.4% in the “identity ≥ 100%” group for DomSign) of the annotation results.
Mentions: Owing to the top-down nature of our approach, we designed a new result evaluation system to use instead of the widely used recall-precision curve [19] that differentiates the annotation results at different levels, resulting in better resolution. Briefly, the predicted EC number (PE) is compared with the true EC number (TE), and the result is classified into the following groups (Figure 3A, right): E—Equality, PE is the same as TE (“PE: EC = 1.2.3.4” vs. “TE: EC = 1.2.3.4”); OP—Overprediction, there is at least one incorrectly assigned EC digit in PE compared with TE (“PE: EC = 1.2.1.1” vs. “TE: EC = 1.2.3.4”); IA—Insufficient Annotation, PE is correct but not complete compared with TE (“PE: EC = 1.2.-.-” vs. “TE: EC = 1.2.3.4”); and IM—Improvement, TE is the parent family of PE (“PE: EC = 1.2.3.4” vs. “TE: EC = 1.2.3.-”). When TE is “Non-enzyme”, if the PE equals “Non-enzyme”, then the comparison result is set as “Equality”. Otherwise, the result is “Overprediction”. Additionally, if PE is “Non-enzyme” and TE is not, then the comparison result is set as IA. What needs to be specifically mentioned here is IA. Although this result means incomplete annotation, it is correct and does not cause any increase in the error rate. Thus, IA provides better annotation coverage and simultaneously maintains high precision. The evaluation metrics defined here differ from traditional ones [19]. However, compared with previous precision-recall curves that equally consider different EC hierarchy levels, this system covers all the possible situations and also gives an intuitive view of the performance at different annotation levels with higher resolution, which is especially suitable for evaluating annotation results using metrics of a hierarchical structure.Figure 3

Bottom Line: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database.Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL.Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. wtm12@mails.tsinghua.edu.cn.

ABSTRACT

Background: Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature-based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.

Results: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.

Conclusions: Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Show MeSH