Limits...
DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe.

Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T - BMC Bioinformatics (2015)

Bottom Line: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database.Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL.Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. wtm12@mails.tsinghua.edu.cn.

ABSTRACT

Background: Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature-based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.

Results: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.

Conclusions: Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Show MeSH
Schematic representation of the DomSign pipeline. The pipeline is divided into two parts—enzyme candidate selection and EC number annotation. In the first step, specific enzyme DSs are utilized, and all proteins with DSs within this dataset are selected as potential enzyme candidates. Simultaneously, four annotation references for the EC digits at four levels are constructed as described in Figure 1. At every level, if the “specificity” of the corresponding DS-EC pair in the annotation reference is less than the user-defined threshold, the pipeline is shut down and the previously annotated EC digits form the output. If not, the pipeline continues until the fourth EC digit has been annotated. An example of the DomSign procedure to annotate a protein is shown here. Because the specificity threshold is above the specificity of the DS-EC pair at the last level, only the first three DS-EC digits are predicted, leading to final result: EC = 1.1.1.-.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4389672&req=5

Fig2: Schematic representation of the DomSign pipeline. The pipeline is divided into two parts—enzyme candidate selection and EC number annotation. In the first step, specific enzyme DSs are utilized, and all proteins with DSs within this dataset are selected as potential enzyme candidates. Simultaneously, four annotation references for the EC digits at four levels are constructed as described in Figure 1. At every level, if the “specificity” of the corresponding DS-EC pair in the annotation reference is less than the user-defined threshold, the pipeline is shut down and the previously annotated EC digits form the output. If not, the pipeline continues until the fourth EC digit has been annotated. An example of the DomSign procedure to annotate a protein is shown here. Because the specificity threshold is above the specificity of the DS-EC pair at the last level, only the first three DS-EC digits are predicted, leading to final result: EC = 1.1.1.-.

Mentions: First, the training dataset was used to construct four prediction models for each EC hierarchy level, and the DSs of query proteins were calculated by hmmsearch with a cut_tc cutoff and all other parameters set as default. Then, the specific enzyme DS dataset was used to select potential enzyme candidates from query proteins. Then, four constructed prediction models were used one by one to annotate EC digits, assigning the EC number that corresponds to the query DS. In this process, a specificity threshold is applied to balance precision and recall. Specifically, when the “specificity” of the DS-EC pair is less than the specificity threshold, the procedure is shut down and only the EC digits annotated previously form the output (Figure 2). In this way, the precision can be increased by making the specificity threshold stricter with a loss of recall, or vice versa. Additionally, although it is not statistically rigorous, the specificity for one particular DS-EC pair can be used as the confidence score to infer the reliability of each prediction by DomSign. For example, if DomSign assigns one enzyme with EC number 1.1.1.- and the specificity values for the DS-EC pair of the first three hierarchical levels are 0.9, 0.88 and 0.85, we can simply set these three parameters as the confidence score for the reliability of prediction for the first three EC digits, respectively. The script package for this tool is provided as Additional file 3.Figure 2


DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe.

Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T - BMC Bioinformatics (2015)

Schematic representation of the DomSign pipeline. The pipeline is divided into two parts—enzyme candidate selection and EC number annotation. In the first step, specific enzyme DSs are utilized, and all proteins with DSs within this dataset are selected as potential enzyme candidates. Simultaneously, four annotation references for the EC digits at four levels are constructed as described in Figure 1. At every level, if the “specificity” of the corresponding DS-EC pair in the annotation reference is less than the user-defined threshold, the pipeline is shut down and the previously annotated EC digits form the output. If not, the pipeline continues until the fourth EC digit has been annotated. An example of the DomSign procedure to annotate a protein is shown here. Because the specificity threshold is above the specificity of the DS-EC pair at the last level, only the first three DS-EC digits are predicted, leading to final result: EC = 1.1.1.-.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4389672&req=5

Fig2: Schematic representation of the DomSign pipeline. The pipeline is divided into two parts—enzyme candidate selection and EC number annotation. In the first step, specific enzyme DSs are utilized, and all proteins with DSs within this dataset are selected as potential enzyme candidates. Simultaneously, four annotation references for the EC digits at four levels are constructed as described in Figure 1. At every level, if the “specificity” of the corresponding DS-EC pair in the annotation reference is less than the user-defined threshold, the pipeline is shut down and the previously annotated EC digits form the output. If not, the pipeline continues until the fourth EC digit has been annotated. An example of the DomSign procedure to annotate a protein is shown here. Because the specificity threshold is above the specificity of the DS-EC pair at the last level, only the first three DS-EC digits are predicted, leading to final result: EC = 1.1.1.-.
Mentions: First, the training dataset was used to construct four prediction models for each EC hierarchy level, and the DSs of query proteins were calculated by hmmsearch with a cut_tc cutoff and all other parameters set as default. Then, the specific enzyme DS dataset was used to select potential enzyme candidates from query proteins. Then, four constructed prediction models were used one by one to annotate EC digits, assigning the EC number that corresponds to the query DS. In this process, a specificity threshold is applied to balance precision and recall. Specifically, when the “specificity” of the DS-EC pair is less than the specificity threshold, the procedure is shut down and only the EC digits annotated previously form the output (Figure 2). In this way, the precision can be increased by making the specificity threshold stricter with a loss of recall, or vice versa. Additionally, although it is not statistically rigorous, the specificity for one particular DS-EC pair can be used as the confidence score to infer the reliability of each prediction by DomSign. For example, if DomSign assigns one enzyme with EC number 1.1.1.- and the specificity values for the DS-EC pair of the first three hierarchical levels are 0.9, 0.88 and 0.85, we can simply set these three parameters as the confidence score for the reliability of prediction for the first three EC digits, respectively. The script package for this tool is provided as Additional file 3.Figure 2

Bottom Line: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database.Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL.Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. wtm12@mails.tsinghua.edu.cn.

ABSTRACT

Background: Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature-based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.

Results: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.

Conclusions: Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Show MeSH