Limits...
Protein function prediction using domain families.

Rentzsch R, Orengo CA - BMC Bioinformatics (2013)

Bottom Line: An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone.For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically.The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

View Article: PubMed Central - HTML - PubMed

Affiliation: Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany. rentzschr@rki.de

ABSTRACT
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

Show MeSH
Two example domain sequence clusters and the associated protein function annotations. All domains are coloured and labelled according to their true functions; the available high-quality GO annotations of the parent proteins are shown on the right, respectively. The GO terms are coloured according to the protein functions they describe, and their hierarchical relationships in the GO DAG are shown at the bottom, respectively (dashed lines represent omitted intermediate terms). Both clusters (dashed boxes) represent functional domain families according to criteria outlined in the main text. The three reductase functions in (a) are closely related, as indicated by the three-functional cluster member sequences. In (b), the hydrolase function is perfectly conserved among all member domain sequences. Note that both the true domain functions and the types of the available annotations (core, extra and foreign-domain; see main text) are 'invisible' to the core set identification protocol. The small numbers next to specific functions indicate in which step of the iterative protocol described in the main text they are identified as 'core-associated' functions (see annotation editing section).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3584934&req=5

Figure 4: Two example domain sequence clusters and the associated protein function annotations. All domains are coloured and labelled according to their true functions; the available high-quality GO annotations of the parent proteins are shown on the right, respectively. The GO terms are coloured according to the protein functions they describe, and their hierarchical relationships in the GO DAG are shown at the bottom, respectively (dashed lines represent omitted intermediate terms). Both clusters (dashed boxes) represent functional domain families according to criteria outlined in the main text. The three reductase functions in (a) are closely related, as indicated by the three-functional cluster member sequences. In (b), the hydrolase function is perfectly conserved among all member domain sequences. Note that both the true domain functions and the types of the available annotations (core, extra and foreign-domain; see main text) are 'invisible' to the core set identification protocol. The small numbers next to specific functions indicate in which step of the iterative protocol described in the main text they are identified as 'core-associated' functions (see annotation editing section).

Mentions: Figure 4a shows a simple example annotation scenario, where a cluster contains four domain sequences with conserved reductase activity; these are the centred domains (domain II) in the parent protein chains on the left, respectively. This domain is multi-functional in the sense that it can perform the same reaction on a range of highly similar (co-)substrates (Figure 4a; bottom). For simplicity, the other domains in these proteins are assumed to have scaffold function only. The (partially incomplete) annotations of the parent proteins are shown on the right. In this example the initial core set is (C1, P1), based on the two sequences associated with a single MF term (the smallest occurring sequence MF term sets).


Protein function prediction using domain families.

Rentzsch R, Orengo CA - BMC Bioinformatics (2013)

Two example domain sequence clusters and the associated protein function annotations. All domains are coloured and labelled according to their true functions; the available high-quality GO annotations of the parent proteins are shown on the right, respectively. The GO terms are coloured according to the protein functions they describe, and their hierarchical relationships in the GO DAG are shown at the bottom, respectively (dashed lines represent omitted intermediate terms). Both clusters (dashed boxes) represent functional domain families according to criteria outlined in the main text. The three reductase functions in (a) are closely related, as indicated by the three-functional cluster member sequences. In (b), the hydrolase function is perfectly conserved among all member domain sequences. Note that both the true domain functions and the types of the available annotations (core, extra and foreign-domain; see main text) are 'invisible' to the core set identification protocol. The small numbers next to specific functions indicate in which step of the iterative protocol described in the main text they are identified as 'core-associated' functions (see annotation editing section).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3584934&req=5

Figure 4: Two example domain sequence clusters and the associated protein function annotations. All domains are coloured and labelled according to their true functions; the available high-quality GO annotations of the parent proteins are shown on the right, respectively. The GO terms are coloured according to the protein functions they describe, and their hierarchical relationships in the GO DAG are shown at the bottom, respectively (dashed lines represent omitted intermediate terms). Both clusters (dashed boxes) represent functional domain families according to criteria outlined in the main text. The three reductase functions in (a) are closely related, as indicated by the three-functional cluster member sequences. In (b), the hydrolase function is perfectly conserved among all member domain sequences. Note that both the true domain functions and the types of the available annotations (core, extra and foreign-domain; see main text) are 'invisible' to the core set identification protocol. The small numbers next to specific functions indicate in which step of the iterative protocol described in the main text they are identified as 'core-associated' functions (see annotation editing section).
Mentions: Figure 4a shows a simple example annotation scenario, where a cluster contains four domain sequences with conserved reductase activity; these are the centred domains (domain II) in the parent protein chains on the left, respectively. This domain is multi-functional in the sense that it can perform the same reaction on a range of highly similar (co-)substrates (Figure 4a; bottom). For simplicity, the other domains in these proteins are assumed to have scaffold function only. The (partially incomplete) annotations of the parent proteins are shown on the right. In this example the initial core set is (C1, P1), based on the two sequences associated with a single MF term (the smallest occurring sequence MF term sets).

Bottom Line: An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone.For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically.The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

View Article: PubMed Central - HTML - PubMed

Affiliation: Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany. rentzschr@rki.de

ABSTRACT
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

Show MeSH