Limits...
Protein function prediction using domain families.

Rentzsch R, Orengo CA - BMC Bioinformatics (2013)

Bottom Line: An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone.For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically.The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

View Article: PubMed Central - HTML - PubMed

Affiliation: Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany. rentzschr@rki.de

ABSTRACT
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

Show MeSH
Protein families versus protein domain families in protein function prediction. (a) Six evolutionarily related families of proteins with assigned domains. The proteins are coloured by function, where mixed colouring indicates multi-functionality. The domain dashing patterns indicate different superfamilies. A protein family resource would build one model per protein family (middle; coloured squares) and scan target proteins with these models, to assign them to families. (b) The domains from the proteins in (a) in their domain superfamilies, coloured by the function of the respective parent protein. Each superfamily is subdivided into functional families (dashed lines), based on the protocol described in the main text. Note that domains from functionally very similar proteins (red, orange, yellow) can go to the same family. The domain-based protein function prediction protocol first identifies domains in the target protein (bottom) and then scans each domain sequence with the functional family models available for its domain superfamily (middle). Each functional family is associated probabilistically with different whole-protein functions. Based on the family assignments of the individual domains, a combined function prediction for the whole protein is made. The best-scoring protein and domain family models are highlighted with a bold border in (a) and (b), respectively.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3584934&req=5

Figure 1: Protein families versus protein domain families in protein function prediction. (a) Six evolutionarily related families of proteins with assigned domains. The proteins are coloured by function, where mixed colouring indicates multi-functionality. The domain dashing patterns indicate different superfamilies. A protein family resource would build one model per protein family (middle; coloured squares) and scan target proteins with these models, to assign them to families. (b) The domains from the proteins in (a) in their domain superfamilies, coloured by the function of the respective parent protein. Each superfamily is subdivided into functional families (dashed lines), based on the protocol described in the main text. Note that domains from functionally very similar proteins (red, orange, yellow) can go to the same family. The domain-based protein function prediction protocol first identifies domains in the target protein (bottom) and then scans each domain sequence with the functional family models available for its domain superfamily (middle). Each functional family is associated probabilistically with different whole-protein functions. Based on the family assignments of the individual domains, a combined function prediction for the whole protein is made. The best-scoring protein and domain family models are highlighted with a bold border in (a) and (b), respectively.

Mentions: A common sequence-based ProFP approach is the assignment of target proteins to an existing library of protein families, for example, those found in PANTHER [15] or TIGRFAMs [16], where each family is associated with an overall protein function (Figure 1a). Correspondingly, our method assigns each identified domain in a given target protein to an initially established library of domain families (Figure 1b), where each family is associated probabilistically with a set of protein functions. In this manner, we take into account the appearance of domains in different multi-domain and, therefore, functional contexts. The function assignments obtained for all domains are further integrated into whole-protein function predictions using a simple protocol.


Protein function prediction using domain families.

Rentzsch R, Orengo CA - BMC Bioinformatics (2013)

Protein families versus protein domain families in protein function prediction. (a) Six evolutionarily related families of proteins with assigned domains. The proteins are coloured by function, where mixed colouring indicates multi-functionality. The domain dashing patterns indicate different superfamilies. A protein family resource would build one model per protein family (middle; coloured squares) and scan target proteins with these models, to assign them to families. (b) The domains from the proteins in (a) in their domain superfamilies, coloured by the function of the respective parent protein. Each superfamily is subdivided into functional families (dashed lines), based on the protocol described in the main text. Note that domains from functionally very similar proteins (red, orange, yellow) can go to the same family. The domain-based protein function prediction protocol first identifies domains in the target protein (bottom) and then scans each domain sequence with the functional family models available for its domain superfamily (middle). Each functional family is associated probabilistically with different whole-protein functions. Based on the family assignments of the individual domains, a combined function prediction for the whole protein is made. The best-scoring protein and domain family models are highlighted with a bold border in (a) and (b), respectively.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3584934&req=5

Figure 1: Protein families versus protein domain families in protein function prediction. (a) Six evolutionarily related families of proteins with assigned domains. The proteins are coloured by function, where mixed colouring indicates multi-functionality. The domain dashing patterns indicate different superfamilies. A protein family resource would build one model per protein family (middle; coloured squares) and scan target proteins with these models, to assign them to families. (b) The domains from the proteins in (a) in their domain superfamilies, coloured by the function of the respective parent protein. Each superfamily is subdivided into functional families (dashed lines), based on the protocol described in the main text. Note that domains from functionally very similar proteins (red, orange, yellow) can go to the same family. The domain-based protein function prediction protocol first identifies domains in the target protein (bottom) and then scans each domain sequence with the functional family models available for its domain superfamily (middle). Each functional family is associated probabilistically with different whole-protein functions. Based on the family assignments of the individual domains, a combined function prediction for the whole protein is made. The best-scoring protein and domain family models are highlighted with a bold border in (a) and (b), respectively.
Mentions: A common sequence-based ProFP approach is the assignment of target proteins to an existing library of protein families, for example, those found in PANTHER [15] or TIGRFAMs [16], where each family is associated with an overall protein function (Figure 1a). Correspondingly, our method assigns each identified domain in a given target protein to an initially established library of domain families (Figure 1b), where each family is associated probabilistically with a set of protein functions. In this manner, we take into account the appearance of domains in different multi-domain and, therefore, functional contexts. The function assignments obtained for all domains are further integrated into whole-protein function predictions using a simple protocol.

Bottom Line: An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone.For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically.The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

View Article: PubMed Central - HTML - PubMed

Affiliation: Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany. rentzschr@rki.de

ABSTRACT
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

Show MeSH