Limits...
Protein function prediction using domain families.

Rentzsch R, Orengo CA - BMC Bioinformatics (2013)

Bottom Line: An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone.For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically.The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

View Article: PubMed Central - HTML - PubMed

Affiliation: Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany. rentzschr@rki.de

ABSTRACT
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

Show MeSH
The relative quality of the FunFam and different fixed-granularity GeMMA family partitionings obtained for 466 enzyme domain superfamilies in Gene3D. (a) The fraction of superfamilies for which a given method or clustering granularity setting yields or shares the top performance score. (b) Box and whisker plot of the average performance scores (diamonds) attained over all superfamilies with a given method or setting. Note that the maximum observed value (top whisker) can be considerably higher than the others; these putative outlier values are indicated at the top.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3584934&req=5

Figure 5: The relative quality of the FunFam and different fixed-granularity GeMMA family partitionings obtained for 466 enzyme domain superfamilies in Gene3D. (a) The fraction of superfamilies for which a given method or clustering granularity setting yields or shares the top performance score. (b) Box and whisker plot of the average performance scores (diamonds) attained over all superfamilies with a given method or setting. Note that the maximum observed value (top whisker) can be considerably higher than the others; these putative outlier values are indicated at the top.

Mentions: The degree of sequence and (parent protein) function diversity varies highly among the domain superfamilies in Gene3D, corresponding to the different degrees of versatility and evolutionary 'success' of individual domain types [19]. Using function annotation data should therefore yield much better family partitionings than the use of clustering (at a fixed granularity level) alone. To confirm this expectation, FunFam families and GeMMA families were compared, based on a test set of 466 superfamilies with high-quality enzyme annotations (EC4s). The results are shown in Figure 5.


Protein function prediction using domain families.

Rentzsch R, Orengo CA - BMC Bioinformatics (2013)

The relative quality of the FunFam and different fixed-granularity GeMMA family partitionings obtained for 466 enzyme domain superfamilies in Gene3D. (a) The fraction of superfamilies for which a given method or clustering granularity setting yields or shares the top performance score. (b) Box and whisker plot of the average performance scores (diamonds) attained over all superfamilies with a given method or setting. Note that the maximum observed value (top whisker) can be considerably higher than the others; these putative outlier values are indicated at the top.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3584934&req=5

Figure 5: The relative quality of the FunFam and different fixed-granularity GeMMA family partitionings obtained for 466 enzyme domain superfamilies in Gene3D. (a) The fraction of superfamilies for which a given method or clustering granularity setting yields or shares the top performance score. (b) Box and whisker plot of the average performance scores (diamonds) attained over all superfamilies with a given method or setting. Note that the maximum observed value (top whisker) can be considerably higher than the others; these putative outlier values are indicated at the top.
Mentions: The degree of sequence and (parent protein) function diversity varies highly among the domain superfamilies in Gene3D, corresponding to the different degrees of versatility and evolutionary 'success' of individual domain types [19]. Using function annotation data should therefore yield much better family partitionings than the use of clustering (at a fixed granularity level) alone. To confirm this expectation, FunFam families and GeMMA families were compared, based on a test set of 466 superfamilies with high-quality enzyme annotations (EC4s). The results are shown in Figure 5.

Bottom Line: An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone.For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically.The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

View Article: PubMed Central - HTML - PubMed

Affiliation: Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany. rentzschr@rki.de

ABSTRACT
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

Show MeSH