Limits...
DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe.

Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T - BMC Bioinformatics (2015)

Bottom Line: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database.Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL.Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. wtm12@mails.tsinghua.edu.cn.

ABSTRACT

Background: Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature-based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.

Results: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.

Conclusions: Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Show MeSH
Construction of the machine-learning model to predict EC numbers. (1) Test dataset: DSs and EC numbers for every enzyme were extracted from original datasets, such as Swiss-Prot. (2) These proteins were categorized into groups based on common DSs. Subsequently, the groups were divided into subgroups based on the corresponding EC numbers. Thus, the numbers in each cell represent the number of proteins in each subgroup, and the total member number for each group is summarized in the last row. The numbers of dominant subgroups within one group are colored red. (3) The abundance of each subgroup within its parent group (the same DS) was calculated and represented. The abundance of dominant subgroups for each group (the same DS) is colored red. (4) Prediction model: Every DS was associated with the relevant dominant EC number within its protein group (carrying this DS). The abundance of dominant EC subgroups was extracted and set as the “specificity” for this EC-DS pair.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4389672&req=5

Fig1: Construction of the machine-learning model to predict EC numbers. (1) Test dataset: DSs and EC numbers for every enzyme were extracted from original datasets, such as Swiss-Prot. (2) These proteins were categorized into groups based on common DSs. Subsequently, the groups were divided into subgroups based on the corresponding EC numbers. Thus, the numbers in each cell represent the number of proteins in each subgroup, and the total member number for each group is summarized in the last row. The numbers of dominant subgroups within one group are colored red. (3) The abundance of each subgroup within its parent group (the same DS) was calculated and represented. The abundance of dominant subgroups for each group (the same DS) is colored red. (4) Prediction model: Every DS was associated with the relevant dominant EC number within its protein group (carrying this DS). The abundance of dominant EC subgroups was extracted and set as the “specificity” for this EC-DS pair.

Mentions: First, we converted the training dataset into a list in which every protein had one DS and one EC number (Figure 1(1)). Subsequently, the proteins were categorized based on their DSs. Thus, we constructed a series of protein groups in which all members contained the same DS. Here, we define the number of member proteins in one group as NDSi. Then, the member proteins in one group were further divided into subgroups based on their EC numbers, leading to a protein subgroup with the same EC (NDSi − ECj and NDSi = ∑jNDSi − ECj) (Figure 1(2)). Further, the abundance of every subgroup among one protein group was calculated (ADSi-ECj = NDSi-ECj/NDSi) (Figure 1(3)). In each group, there exists at least one dominant subgroup with the highest abundance. The EC number for this subgroup is then associated with the relevant DS, whereas the abundance of this subgroup is defined as the “specificity” for this DS-EC pair, which acts as the fundamental parameter in the machine-learning model (Figure 1(4)). We constructed four prediction models to assign four levels for one complete EC hierarchy. For each model, at the first step (Figure 1(1)) one fraction of the EC number was extracted—for instance, for the model focusing on the second EC hierarchy, EC = x.x.-.- is extracted. All further steps were the same during the construction of these four models. Thus, this machine-learning approach makes it possible to annotate the EC hierarchy from general to specific where the “specificity” of DS-EC pairs can be used to balance the tradeoff between recall and precision, depending on the particular purpose.Figure 1


DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe.

Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T - BMC Bioinformatics (2015)

Construction of the machine-learning model to predict EC numbers. (1) Test dataset: DSs and EC numbers for every enzyme were extracted from original datasets, such as Swiss-Prot. (2) These proteins were categorized into groups based on common DSs. Subsequently, the groups were divided into subgroups based on the corresponding EC numbers. Thus, the numbers in each cell represent the number of proteins in each subgroup, and the total member number for each group is summarized in the last row. The numbers of dominant subgroups within one group are colored red. (3) The abundance of each subgroup within its parent group (the same DS) was calculated and represented. The abundance of dominant subgroups for each group (the same DS) is colored red. (4) Prediction model: Every DS was associated with the relevant dominant EC number within its protein group (carrying this DS). The abundance of dominant EC subgroups was extracted and set as the “specificity” for this EC-DS pair.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4389672&req=5

Fig1: Construction of the machine-learning model to predict EC numbers. (1) Test dataset: DSs and EC numbers for every enzyme were extracted from original datasets, such as Swiss-Prot. (2) These proteins were categorized into groups based on common DSs. Subsequently, the groups were divided into subgroups based on the corresponding EC numbers. Thus, the numbers in each cell represent the number of proteins in each subgroup, and the total member number for each group is summarized in the last row. The numbers of dominant subgroups within one group are colored red. (3) The abundance of each subgroup within its parent group (the same DS) was calculated and represented. The abundance of dominant subgroups for each group (the same DS) is colored red. (4) Prediction model: Every DS was associated with the relevant dominant EC number within its protein group (carrying this DS). The abundance of dominant EC subgroups was extracted and set as the “specificity” for this EC-DS pair.
Mentions: First, we converted the training dataset into a list in which every protein had one DS and one EC number (Figure 1(1)). Subsequently, the proteins were categorized based on their DSs. Thus, we constructed a series of protein groups in which all members contained the same DS. Here, we define the number of member proteins in one group as NDSi. Then, the member proteins in one group were further divided into subgroups based on their EC numbers, leading to a protein subgroup with the same EC (NDSi − ECj and NDSi = ∑jNDSi − ECj) (Figure 1(2)). Further, the abundance of every subgroup among one protein group was calculated (ADSi-ECj = NDSi-ECj/NDSi) (Figure 1(3)). In each group, there exists at least one dominant subgroup with the highest abundance. The EC number for this subgroup is then associated with the relevant DS, whereas the abundance of this subgroup is defined as the “specificity” for this DS-EC pair, which acts as the fundamental parameter in the machine-learning model (Figure 1(4)). We constructed four prediction models to assign four levels for one complete EC hierarchy. For each model, at the first step (Figure 1(1)) one fraction of the EC number was extracted—for instance, for the model focusing on the second EC hierarchy, EC = x.x.-.- is extracted. All further steps were the same during the construction of these four models. Thus, this machine-learning approach makes it possible to annotate the EC hierarchy from general to specific where the “specificity” of DS-EC pairs can be used to balance the tradeoff between recall and precision, depending on the particular purpose.Figure 1

Bottom Line: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database.Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL.Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. wtm12@mails.tsinghua.edu.cn.

ABSTRACT

Background: Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature-based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.

Results: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.

Conclusions: Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Show MeSH