Limits...
Protein function prediction using domain families.

Rentzsch R, Orengo CA - BMC Bioinformatics (2013)

Bottom Line: An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone.For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically.The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

View Article: PubMed Central - HTML - PubMed

Affiliation: Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany. rentzschr@rki.de

ABSTRACT
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

Show MeSH
Comparison of the GeMMA and FunFam family identification protocols. (a) The partial GeMMA clustering dendrogram of a sequence superfamily. Different colours correspond to the protein functions associated with the clusters; grey indicates a lack of annotations. (b) The corresponding part of the dendrogram when using the FunFam protocol. Note that unannotated starting clusters (grey) are here removed prior to clustering. The COMPASS [33] E-values at the bottom of both subfigures reflect the maximum sequence profile similarity observed between any two clusters at a given point, which decreases in the course of clustering [19]. The number of clusters (shown in this part of the dendrogram) that still exist when stopping the clustering at a given granularity level is stated at the top. Arrows indicate which clusters are eventually selected to represent functional families.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3584934&req=5

Figure 2: Comparison of the GeMMA and FunFam family identification protocols. (a) The partial GeMMA clustering dendrogram of a sequence superfamily. Different colours correspond to the protein functions associated with the clusters; grey indicates a lack of annotations. (b) The corresponding part of the dendrogram when using the FunFam protocol. Note that unannotated starting clusters (grey) are here removed prior to clustering. The COMPASS [33] E-values at the bottom of both subfigures reflect the maximum sequence profile similarity observed between any two clusters at a given point, which decreases in the course of clustering [19]. The number of clusters (shown in this part of the dendrogram) that still exist when stopping the clustering at a given granularity level is stated at the top. Arrows indicate which clusters are eventually selected to represent functional families.

Mentions: Figure 2 summarises the differences between the original GeMMA protocol (a) and the CAFA FunFam protocol (b) by use of an example. Both parts shows the partial clustering dendrogram of a given domain superfamily, processed with either method. The clusters are coloured by the functions (annotations) of the sequences they contain, respectively; grey indicates a lack of annotations.


Protein function prediction using domain families.

Rentzsch R, Orengo CA - BMC Bioinformatics (2013)

Comparison of the GeMMA and FunFam family identification protocols. (a) The partial GeMMA clustering dendrogram of a sequence superfamily. Different colours correspond to the protein functions associated with the clusters; grey indicates a lack of annotations. (b) The corresponding part of the dendrogram when using the FunFam protocol. Note that unannotated starting clusters (grey) are here removed prior to clustering. The COMPASS [33] E-values at the bottom of both subfigures reflect the maximum sequence profile similarity observed between any two clusters at a given point, which decreases in the course of clustering [19]. The number of clusters (shown in this part of the dendrogram) that still exist when stopping the clustering at a given granularity level is stated at the top. Arrows indicate which clusters are eventually selected to represent functional families.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3584934&req=5

Figure 2: Comparison of the GeMMA and FunFam family identification protocols. (a) The partial GeMMA clustering dendrogram of a sequence superfamily. Different colours correspond to the protein functions associated with the clusters; grey indicates a lack of annotations. (b) The corresponding part of the dendrogram when using the FunFam protocol. Note that unannotated starting clusters (grey) are here removed prior to clustering. The COMPASS [33] E-values at the bottom of both subfigures reflect the maximum sequence profile similarity observed between any two clusters at a given point, which decreases in the course of clustering [19]. The number of clusters (shown in this part of the dendrogram) that still exist when stopping the clustering at a given granularity level is stated at the top. Arrows indicate which clusters are eventually selected to represent functional families.
Mentions: Figure 2 summarises the differences between the original GeMMA protocol (a) and the CAFA FunFam protocol (b) by use of an example. Both parts shows the partial clustering dendrogram of a given domain superfamily, processed with either method. The clusters are coloured by the functions (annotations) of the sequences they contain, respectively; grey indicates a lack of annotations.

Bottom Line: An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone.For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically.The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

View Article: PubMed Central - HTML - PubMed

Affiliation: Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany. rentzschr@rki.de

ABSTRACT
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

Show MeSH