Limits...
A domain-centric solution to functional genomics via dcGO Predictor.

Fang H, Gough J - BMC Bioinformatics (2013)

Bottom Line: We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations.The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences.This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, UK. hfang@cs.bris.ac.uk

ABSTRACT

Background: Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics.

Results: Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool.

Conclusions: As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era.

Show MeSH

Related in: MedlinePlus

Flow diagram of generating meta-GO terms through information content analysis of domain-centric GO annotations. Briefly, all GO terms in DAG are initially unmarked. Then, identify those unmarked GO terms with IC closest to a predefined IC (e.g., 1). Mark those identified terms and all of their ancestors and descendants, being excluded from further search. Continue the previous two steps to iteratively identify unmarked GO terms until all GO terms in DAG are marked. Finally, output only those identified GO terms with IC falling in the range (e.g., [0.75 1.25]) as a meta-GO. The bottom panel displays the compositions in meta-GO terms for domains at SCOP family (FA) level, at SCOP superfamily (SF) level, and supra-domains.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3584936&req=5

Figure 3: Flow diagram of generating meta-GO terms through information content analysis of domain-centric GO annotations. Briefly, all GO terms in DAG are initially unmarked. Then, identify those unmarked GO terms with IC closest to a predefined IC (e.g., 1). Mark those identified terms and all of their ancestors and descendants, being excluded from further search. Continue the previous two steps to iteratively identify unmarked GO terms until all GO terms in DAG are marked. Finally, output only those identified GO terms with IC falling in the range (e.g., [0.75 1.25]) as a meta-GO. The bottom panel displays the compositions in meta-GO terms for domains at SCOP family (FA) level, at SCOP superfamily (SF) level, and supra-domains.

Mentions: We first examined the PR-RC curves of our prediction using both domains and supra-domains for eukaryotic sets (Figure 2A). Considering purely domain information is used, dcGO predictions were remarkably successful in recovering true functional annotations. Our prediction yielded the best results for Euk_set6, which is consistent with the highest percentages of annotatable domains/supra-domains. We also found that using GO terms in MF (top panel in Figure 2A) outperformed using those in BP (bottom panel in Figure 2A), indicating that molecular functional aspect is more relevant to describing the domains/supra-domains. Interestingly, limiting the prediction to the individual domains only slightly reduced performance when plotting PR-RC curves for the whole eukaryotic sets (Figure 2B). Further examination of domain compositions of these eukaryotic targets reveals that only one-third of the targets were of multi-domain proteins, which is far less than the average of 70% for eukaryotic proteins (as discussed in the Background section). We expect that the inclusion of supra-domains would lead to much better function prediction if a more representative set of multi-domain targets were to be included. When applied to prokaryotic sets (for which there is insufficient data for a proper evaluation, as stated in the CAFA experiment [20]), surprisingly we observed a similar overall performance to the eukaryotic sets (Figure 3C). This observation partially implies that the dcGO approach is not so sensitive to the sequences of different origins as long as these sequences to be predicted are not so atypical in terms of domain content they have.


A domain-centric solution to functional genomics via dcGO Predictor.

Fang H, Gough J - BMC Bioinformatics (2013)

Flow diagram of generating meta-GO terms through information content analysis of domain-centric GO annotations. Briefly, all GO terms in DAG are initially unmarked. Then, identify those unmarked GO terms with IC closest to a predefined IC (e.g., 1). Mark those identified terms and all of their ancestors and descendants, being excluded from further search. Continue the previous two steps to iteratively identify unmarked GO terms until all GO terms in DAG are marked. Finally, output only those identified GO terms with IC falling in the range (e.g., [0.75 1.25]) as a meta-GO. The bottom panel displays the compositions in meta-GO terms for domains at SCOP family (FA) level, at SCOP superfamily (SF) level, and supra-domains.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3584936&req=5

Figure 3: Flow diagram of generating meta-GO terms through information content analysis of domain-centric GO annotations. Briefly, all GO terms in DAG are initially unmarked. Then, identify those unmarked GO terms with IC closest to a predefined IC (e.g., 1). Mark those identified terms and all of their ancestors and descendants, being excluded from further search. Continue the previous two steps to iteratively identify unmarked GO terms until all GO terms in DAG are marked. Finally, output only those identified GO terms with IC falling in the range (e.g., [0.75 1.25]) as a meta-GO. The bottom panel displays the compositions in meta-GO terms for domains at SCOP family (FA) level, at SCOP superfamily (SF) level, and supra-domains.
Mentions: We first examined the PR-RC curves of our prediction using both domains and supra-domains for eukaryotic sets (Figure 2A). Considering purely domain information is used, dcGO predictions were remarkably successful in recovering true functional annotations. Our prediction yielded the best results for Euk_set6, which is consistent with the highest percentages of annotatable domains/supra-domains. We also found that using GO terms in MF (top panel in Figure 2A) outperformed using those in BP (bottom panel in Figure 2A), indicating that molecular functional aspect is more relevant to describing the domains/supra-domains. Interestingly, limiting the prediction to the individual domains only slightly reduced performance when plotting PR-RC curves for the whole eukaryotic sets (Figure 2B). Further examination of domain compositions of these eukaryotic targets reveals that only one-third of the targets were of multi-domain proteins, which is far less than the average of 70% for eukaryotic proteins (as discussed in the Background section). We expect that the inclusion of supra-domains would lead to much better function prediction if a more representative set of multi-domain targets were to be included. When applied to prokaryotic sets (for which there is insufficient data for a proper evaluation, as stated in the CAFA experiment [20]), surprisingly we observed a similar overall performance to the eukaryotic sets (Figure 3C). This observation partially implies that the dcGO approach is not so sensitive to the sequences of different origins as long as these sequences to be predicted are not so atypical in terms of domain content they have.

Bottom Line: We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations.The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences.This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, UK. hfang@cs.bris.ac.uk

ABSTRACT

Background: Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics.

Results: Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool.

Conclusions: As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era.

Show MeSH
Related in: MedlinePlus