Limits...
DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe.

Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T - BMC Bioinformatics (2015)

Bottom Line: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database.Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL.Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. wtm12@mails.tsinghua.edu.cn.

ABSTRACT

Background: Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature-based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.

Results: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.

Conclusions: Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Show MeSH
Expansion of enzyme space in HMP non-redundant proteins by DomSign (specificity threshold = 80%). The circles illustrate the distribution of four kinds of proteins in the HMP non-redundant dataset. Red: enzymes with EC numbers annotated exclusively by HMP; blue: novel enzymes exclusively predicted by DomSign; green: enzymes identified by both HMP and DomSign; purple: all remaining proteins. The column on the right represents the ratio of EC hierarchy levels for predicted novel enzymes by DomSign, similar to the description in Figure 5.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4389672&req=5

Fig6: Expansion of enzyme space in HMP non-redundant proteins by DomSign (specificity threshold = 80%). The circles illustrate the distribution of four kinds of proteins in the HMP non-redundant dataset. Red: enzymes with EC numbers annotated exclusively by HMP; blue: novel enzymes exclusively predicted by DomSign; green: enzymes identified by both HMP and DomSign; purple: all remaining proteins. The column on the right represents the ratio of EC hierarchy levels for predicted novel enzymes by DomSign, similar to the description in Figure 5.

Mentions: DomSign can recover more enzymes from this metagenomic dataset (Figure 6 and Additional file 11). Approximately one million new enzymes can be annotated with EC numbers exclusively by DomSign (around 7% of proteins in HMP set) (Additional file 12), and 84% of them contain at least three EC digits. DomSign and HMP also seem to be highly complementary because half of their identified enzymes do not overlap. This is probably owing to the low Pfam-A (45.7%) coverage of HMP proteins and the appearance of many novel DSs in metagenomic sequences. The complementary properties also indicate the possibility that DomSign can detect many different catalytic functions and thus may provide further insight into the metabolic capacity of the human microbiome. To test this hypothesis, we compared the unique four-digit EC numbers retrieved by both approaches. Here, the results for DomSign with a 99% specificity threshold were used to increase the reliability of EC number assignment. As an example, 81 novel EC numbers, which were exclusively detected by DomSign with a 99% specificity threshold, were discovered from the human gut microbiome (stool sample; Additional file 13), indicating one potential biologically significant discovery. These EC numbers may reflect important components that complement the known metabolism of the human microbiome.Figure 6


DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe.

Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T - BMC Bioinformatics (2015)

Expansion of enzyme space in HMP non-redundant proteins by DomSign (specificity threshold = 80%). The circles illustrate the distribution of four kinds of proteins in the HMP non-redundant dataset. Red: enzymes with EC numbers annotated exclusively by HMP; blue: novel enzymes exclusively predicted by DomSign; green: enzymes identified by both HMP and DomSign; purple: all remaining proteins. The column on the right represents the ratio of EC hierarchy levels for predicted novel enzymes by DomSign, similar to the description in Figure 5.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4389672&req=5

Fig6: Expansion of enzyme space in HMP non-redundant proteins by DomSign (specificity threshold = 80%). The circles illustrate the distribution of four kinds of proteins in the HMP non-redundant dataset. Red: enzymes with EC numbers annotated exclusively by HMP; blue: novel enzymes exclusively predicted by DomSign; green: enzymes identified by both HMP and DomSign; purple: all remaining proteins. The column on the right represents the ratio of EC hierarchy levels for predicted novel enzymes by DomSign, similar to the description in Figure 5.
Mentions: DomSign can recover more enzymes from this metagenomic dataset (Figure 6 and Additional file 11). Approximately one million new enzymes can be annotated with EC numbers exclusively by DomSign (around 7% of proteins in HMP set) (Additional file 12), and 84% of them contain at least three EC digits. DomSign and HMP also seem to be highly complementary because half of their identified enzymes do not overlap. This is probably owing to the low Pfam-A (45.7%) coverage of HMP proteins and the appearance of many novel DSs in metagenomic sequences. The complementary properties also indicate the possibility that DomSign can detect many different catalytic functions and thus may provide further insight into the metabolic capacity of the human microbiome. To test this hypothesis, we compared the unique four-digit EC numbers retrieved by both approaches. Here, the results for DomSign with a 99% specificity threshold were used to increase the reliability of EC number assignment. As an example, 81 novel EC numbers, which were exclusively detected by DomSign with a 99% specificity threshold, were discovered from the human gut microbiome (stool sample; Additional file 13), indicating one potential biologically significant discovery. These EC numbers may reflect important components that complement the known metabolism of the human microbiome.Figure 6

Bottom Line: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database.Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL.Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource.

View Article: PubMed Central - PubMed

Affiliation: Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. wtm12@mails.tsinghua.edu.cn.

ABSTRACT

Background: Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature-based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.

Results: DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.

Conclusions: Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Show MeSH