Limits...
A domain-centric solution to functional genomics via dcGO Predictor.

Fang H, Gough J - BMC Bioinformatics (2013)

Bottom Line: We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations.The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences.This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, UK. hfang@cs.bris.ac.uk

ABSTRACT

Background: Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics.

Results: Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool.

Conclusions: As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era.

Show MeSH
A domain-centric GO approach to automatically infer GO annotations for individual domains and supra-domains. The flowchart in the left panel illustrates three major steps of the proposed approach, including (Step 1) the preparation of the correspondence matrix between domains/supra-domains and GO terms from protein-level annotations in UniProtKB-GOA and domain architectures in SUPERFAMILY database, (Step 2) two types of statistical inference followed by FDR calculation, and (Step 3) following the true-path rule to obtain the complete domain-centric GO annotations. The overall inference (I), relative inference (II) and the significance measure (III) are illustrated in the middle panel, both mathematically and graphically. Further illustration (IV) is given by specifying an example of inferring associations between a supra-domain (i.e., '82199,57667') and a GO term (i.e., 'stem cell maintenance') in the right panel. Notably, there are a total of three direct parental GO terms (i.e., 'developmental process', 'negative regulation of cell differentiation', and 'stem cell development') for 'stem cell maintenance', and Npa is the total number of Uniprots that can be annotated by any direct parental GO terms.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3584936&req=5

Figure 1: A domain-centric GO approach to automatically infer GO annotations for individual domains and supra-domains. The flowchart in the left panel illustrates three major steps of the proposed approach, including (Step 1) the preparation of the correspondence matrix between domains/supra-domains and GO terms from protein-level annotations in UniProtKB-GOA and domain architectures in SUPERFAMILY database, (Step 2) two types of statistical inference followed by FDR calculation, and (Step 3) following the true-path rule to obtain the complete domain-centric GO annotations. The overall inference (I), relative inference (II) and the significance measure (III) are illustrated in the middle panel, both mathematically and graphically. Further illustration (IV) is given by specifying an example of inferring associations between a supra-domain (i.e., '82199,57667') and a GO term (i.e., 'stem cell maintenance') in the right panel. Notably, there are a total of three direct parental GO terms (i.e., 'developmental process', 'negative regulation of cell differentiation', and 'stem cell development') for 'stem cell maintenance', and Npa is the total number of Uniprots that can be annotated by any direct parental GO terms.

Mentions: The structural domain information of a protein is closely relevant to biological functions it has. To reveal the extent of functional signals carried by protein domains (and supra-domains in the multi-domain proteins), we developed a domain-centric Gene Ontology (dcGO) approach (Figure 1; see also Methods for details), a generalized extension to our previous proposal [16]. Briefly, the implementation of this approach started from high-coverage domain architectures and high-quality GO annotations for proteins (obtained respectively from SUPERFAMILY [21] and UniProKB-GOAs [22]), resulting in the correspondence matrix between domains/supra-domains and GO terms. Based on this matrix, two types of statistical inference (i.e., overall and relative inference) were performed while respecting the directed acyclic graph (DAG) of GO; these dual inferences aimed to ensure that only the most relevant GO terms could be retained. A false discovery rate (FDR) [23] was then calculated to measure significance of inference, while a hypergeometric score (h-score) used to indicate the strength of inference. Finally, we propagated the inferred GO terms to all their ancestors, generating the complete GO annotations for a domain/supra-domain. The middle panel in Figure 1 gives an account of analytic details, while the right panel illustrates an example of how to infer possible associations between a supra-domain '82199,57667' ('82199' stands for 'SET domain', and '57667' for 'beta-beta-alpha zinc fingers') and a GO term 'GO:0019827' ('stem cell maintenance'). The full results for this example are accessible at [24]. From this link and the Figure 1, we can see a significant association between the supra-domain and the GO term (FDR = 4.96E-8). Interestingly, among the two domains constituting this supra-domain, only 'SET domain' is associated with 'stem cell maintenance' (FDR = 7.15E-3; inherited annotation), but not for 'beta-beta-alpha zinc fingers'. This example clearly shows the necessity of associating two or longer supra-domains with GO terms, as functional units can consist of more than one domain acting together or acting at an interface between domains.


A domain-centric solution to functional genomics via dcGO Predictor.

Fang H, Gough J - BMC Bioinformatics (2013)

A domain-centric GO approach to automatically infer GO annotations for individual domains and supra-domains. The flowchart in the left panel illustrates three major steps of the proposed approach, including (Step 1) the preparation of the correspondence matrix between domains/supra-domains and GO terms from protein-level annotations in UniProtKB-GOA and domain architectures in SUPERFAMILY database, (Step 2) two types of statistical inference followed by FDR calculation, and (Step 3) following the true-path rule to obtain the complete domain-centric GO annotations. The overall inference (I), relative inference (II) and the significance measure (III) are illustrated in the middle panel, both mathematically and graphically. Further illustration (IV) is given by specifying an example of inferring associations between a supra-domain (i.e., '82199,57667') and a GO term (i.e., 'stem cell maintenance') in the right panel. Notably, there are a total of three direct parental GO terms (i.e., 'developmental process', 'negative regulation of cell differentiation', and 'stem cell development') for 'stem cell maintenance', and Npa is the total number of Uniprots that can be annotated by any direct parental GO terms.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3584936&req=5

Figure 1: A domain-centric GO approach to automatically infer GO annotations for individual domains and supra-domains. The flowchart in the left panel illustrates three major steps of the proposed approach, including (Step 1) the preparation of the correspondence matrix between domains/supra-domains and GO terms from protein-level annotations in UniProtKB-GOA and domain architectures in SUPERFAMILY database, (Step 2) two types of statistical inference followed by FDR calculation, and (Step 3) following the true-path rule to obtain the complete domain-centric GO annotations. The overall inference (I), relative inference (II) and the significance measure (III) are illustrated in the middle panel, both mathematically and graphically. Further illustration (IV) is given by specifying an example of inferring associations between a supra-domain (i.e., '82199,57667') and a GO term (i.e., 'stem cell maintenance') in the right panel. Notably, there are a total of three direct parental GO terms (i.e., 'developmental process', 'negative regulation of cell differentiation', and 'stem cell development') for 'stem cell maintenance', and Npa is the total number of Uniprots that can be annotated by any direct parental GO terms.
Mentions: The structural domain information of a protein is closely relevant to biological functions it has. To reveal the extent of functional signals carried by protein domains (and supra-domains in the multi-domain proteins), we developed a domain-centric Gene Ontology (dcGO) approach (Figure 1; see also Methods for details), a generalized extension to our previous proposal [16]. Briefly, the implementation of this approach started from high-coverage domain architectures and high-quality GO annotations for proteins (obtained respectively from SUPERFAMILY [21] and UniProKB-GOAs [22]), resulting in the correspondence matrix between domains/supra-domains and GO terms. Based on this matrix, two types of statistical inference (i.e., overall and relative inference) were performed while respecting the directed acyclic graph (DAG) of GO; these dual inferences aimed to ensure that only the most relevant GO terms could be retained. A false discovery rate (FDR) [23] was then calculated to measure significance of inference, while a hypergeometric score (h-score) used to indicate the strength of inference. Finally, we propagated the inferred GO terms to all their ancestors, generating the complete GO annotations for a domain/supra-domain. The middle panel in Figure 1 gives an account of analytic details, while the right panel illustrates an example of how to infer possible associations between a supra-domain '82199,57667' ('82199' stands for 'SET domain', and '57667' for 'beta-beta-alpha zinc fingers') and a GO term 'GO:0019827' ('stem cell maintenance'). The full results for this example are accessible at [24]. From this link and the Figure 1, we can see a significant association between the supra-domain and the GO term (FDR = 4.96E-8). Interestingly, among the two domains constituting this supra-domain, only 'SET domain' is associated with 'stem cell maintenance' (FDR = 7.15E-3; inherited annotation), but not for 'beta-beta-alpha zinc fingers'. This example clearly shows the necessity of associating two or longer supra-domains with GO terms, as functional units can consist of more than one domain acting together or acting at an interface between domains.

Bottom Line: We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations.The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences.This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, UK. hfang@cs.bris.ac.uk

ABSTRACT

Background: Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics.

Results: Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool.

Conclusions: As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era.

Show MeSH