Limits...
DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe.

Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T - BMC Bioinformatics (2015)

Bottom Line: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database.Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL.Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. wtm12@mails.tsinghua.edu.cn.

ABSTRACT

Background: Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature-based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.

Results: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.

Conclusions: Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Show MeSH
Comparison between DomSign and EnzML using Swiss-Prot&KEGG and Swiss-Prot&KEGG extracted by UniRef50 datasets. The barplot represents accuracy calucated by DomSign(white) and EnzML(gray). In contrast to panels (A) and (B), enzymes that are incorrectly annotated as non-enzymes by DomSign are excluded from the evaluation in panels (C) and (D). “Coverage” in panels (C) and (D) describes the percentage of proteins left after removal of real enzymes that were incorrectly predicted to be non-enzymes. ‘Example based precision’ and ‘Example based recall’ are used to evaluate the result as stated in Methods.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4389672&req=5

Fig4: Comparison between DomSign and EnzML using Swiss-Prot&KEGG and Swiss-Prot&KEGG extracted by UniRef50 datasets. The barplot represents accuracy calucated by DomSign(white) and EnzML(gray). In contrast to panels (A) and (B), enzymes that are incorrectly annotated as non-enzymes by DomSign are excluded from the evaluation in panels (C) and (D). “Coverage” in panels (C) and (D) describes the percentage of proteins left after removal of real enzymes that were incorrectly predicted to be non-enzymes. ‘Example based precision’ and ‘Example based recall’ are used to evaluate the result as stated in Methods.

Mentions: The EnzML model is a multi-label classification method that uses Binary Relevance Nearest Neighbors (BR-kNN) to predict EC numbers [30]. Briefly, this model utilized a more general protein signature set, InterPro [34], rather than Pfam as the input label. A multi-label support vector machine methodology was used, and the k parameter—the number of neighbors considered during the prediction—was optimized to ‘1’. The methodology of the multi-label support vector machines can be intuitively considered as the combination of multiple support vector machines for a series of binary labels (‘yes’ or ‘no’ for one particular EC hierarchy). Noteworthy, Mulan [35], an open-source software infrastructure for evaluation and prediction, is used for this specific work. This model is presently the best benchmark, which has been shown to be superior to some other widely used tools such as ModEnzA [36] and EFICAz2 [37]. “Swiss-Prot&KEGG” and the less redundant “UniRef50 Swiss-Prot&KEGG” [30] datasets were used for the 10-fold cross-validation (Figure 4A, B). Although the differences were not significant, we observed that EnzML performed better than DomSign in terms of example-based precision and recall. To clarify the source of these differences, for our evaluation we excluded the real enzymes that were incorrectly predicted as non-enzymes by DomSign (Figure 4C, D). Thereafter, DomSign’s performance became comparable to that of EnzML. Hence, we assert that the main reason for the loss of precision and recall in DomSign was that it is too strict to differentiate enzyme candidates from protein pools. Therefore, more enzymes are mistakenly categorized into the non-enzyme group by DomSign, leading to the loss of coverage. Even though this problem causes a decrease in the “example-based precision” defined here, it does not cause errors such as predicting the wrong EC number or mistakenly identifying a non-enzyme as an enzyme. Considering that the EnzML model is difficult to implement, we posit that using DomSign would be more facile by comparison with respect to expanding the enzyme space from a large-scale dataset, as discussed in the next section.Figure 4


DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe.

Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T - BMC Bioinformatics (2015)

Comparison between DomSign and EnzML using Swiss-Prot&KEGG and Swiss-Prot&KEGG extracted by UniRef50 datasets. The barplot represents accuracy calucated by DomSign(white) and EnzML(gray). In contrast to panels (A) and (B), enzymes that are incorrectly annotated as non-enzymes by DomSign are excluded from the evaluation in panels (C) and (D). “Coverage” in panels (C) and (D) describes the percentage of proteins left after removal of real enzymes that were incorrectly predicted to be non-enzymes. ‘Example based precision’ and ‘Example based recall’ are used to evaluate the result as stated in Methods.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4389672&req=5

Fig4: Comparison between DomSign and EnzML using Swiss-Prot&KEGG and Swiss-Prot&KEGG extracted by UniRef50 datasets. The barplot represents accuracy calucated by DomSign(white) and EnzML(gray). In contrast to panels (A) and (B), enzymes that are incorrectly annotated as non-enzymes by DomSign are excluded from the evaluation in panels (C) and (D). “Coverage” in panels (C) and (D) describes the percentage of proteins left after removal of real enzymes that were incorrectly predicted to be non-enzymes. ‘Example based precision’ and ‘Example based recall’ are used to evaluate the result as stated in Methods.
Mentions: The EnzML model is a multi-label classification method that uses Binary Relevance Nearest Neighbors (BR-kNN) to predict EC numbers [30]. Briefly, this model utilized a more general protein signature set, InterPro [34], rather than Pfam as the input label. A multi-label support vector machine methodology was used, and the k parameter—the number of neighbors considered during the prediction—was optimized to ‘1’. The methodology of the multi-label support vector machines can be intuitively considered as the combination of multiple support vector machines for a series of binary labels (‘yes’ or ‘no’ for one particular EC hierarchy). Noteworthy, Mulan [35], an open-source software infrastructure for evaluation and prediction, is used for this specific work. This model is presently the best benchmark, which has been shown to be superior to some other widely used tools such as ModEnzA [36] and EFICAz2 [37]. “Swiss-Prot&KEGG” and the less redundant “UniRef50 Swiss-Prot&KEGG” [30] datasets were used for the 10-fold cross-validation (Figure 4A, B). Although the differences were not significant, we observed that EnzML performed better than DomSign in terms of example-based precision and recall. To clarify the source of these differences, for our evaluation we excluded the real enzymes that were incorrectly predicted as non-enzymes by DomSign (Figure 4C, D). Thereafter, DomSign’s performance became comparable to that of EnzML. Hence, we assert that the main reason for the loss of precision and recall in DomSign was that it is too strict to differentiate enzyme candidates from protein pools. Therefore, more enzymes are mistakenly categorized into the non-enzyme group by DomSign, leading to the loss of coverage. Even though this problem causes a decrease in the “example-based precision” defined here, it does not cause errors such as predicting the wrong EC number or mistakenly identifying a non-enzyme as an enzyme. Considering that the EnzML model is difficult to implement, we posit that using DomSign would be more facile by comparison with respect to expanding the enzyme space from a large-scale dataset, as discussed in the next section.Figure 4

Bottom Line: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database.Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL.Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. wtm12@mails.tsinghua.edu.cn.

ABSTRACT

Background: Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature-based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.

Results: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.

Conclusions: Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Show MeSH