Limits...
Protein function prediction using domain families.

Rentzsch R, Orengo CA - BMC Bioinformatics (2013)

Bottom Line: An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone.For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically.The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

View Article: PubMed Central - HTML - PubMed

Affiliation: Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany. rentzschr@rki.de

ABSTRACT
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

Show MeSH
The sequence and cluster term sets involved in assessing cluster functional coherence. The sequences in a given domain sequence cluster are associated with different GO term sets via their parent proteins. Each sequence term set can be split into MF, BP and CC term subsets. If any of the sequence MF term sets is not empty, a set of core MF terms for the cluster is compiled from all sequence MF term sets, and, via further, intermediate steps, a set of terms to be ignored in cluster assessment is compiled. If none of the sequence term sets contains MF terms, the filter term set is an empty set. In the next step, the initial cluster term set is prepared as the union of all representative sequence term sets. From this set, any terms that are also found in the filter term set are removed (filtered), yielding the final cluster term set. Like the sequence term sets, the cluster term set can be split into MF, BP and CC subsets. The term-type specific sets of the representative sequences are compared with those of the cluster as a whole, respectively, to assess the functional coherence of the sequences in the cluster. Key term sets in the described process are highlighted in bold.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3584934&req=5

Figure 3: The sequence and cluster term sets involved in assessing cluster functional coherence. The sequences in a given domain sequence cluster are associated with different GO term sets via their parent proteins. Each sequence term set can be split into MF, BP and CC term subsets. If any of the sequence MF term sets is not empty, a set of core MF terms for the cluster is compiled from all sequence MF term sets, and, via further, intermediate steps, a set of terms to be ignored in cluster assessment is compiled. If none of the sequence term sets contains MF terms, the filter term set is an empty set. In the next step, the initial cluster term set is prepared as the union of all representative sequence term sets. From this set, any terms that are also found in the filter term set are removed (filtered), yielding the final cluster term set. Like the sequence term sets, the cluster term set can be split into MF, BP and CC subsets. The term-type specific sets of the representative sequences are compared with those of the cluster as a whole, respectively, to assess the functional coherence of the sequences in the cluster. Key term sets in the described process are highlighted in bold.

Mentions: All sequence clusters in the generated dendrogram are processed and assessed for functional coherence using the following assumptions and algorithms. The assessment protocol was designed according to the requirement of using GO annotation data; note that it could be much simpler for EC data, for example, as will become clear in the following. Figure 3 serves as a guideline through the cluster analysis workflow, also listing the most important term sets used in the process.


Protein function prediction using domain families.

Rentzsch R, Orengo CA - BMC Bioinformatics (2013)

The sequence and cluster term sets involved in assessing cluster functional coherence. The sequences in a given domain sequence cluster are associated with different GO term sets via their parent proteins. Each sequence term set can be split into MF, BP and CC term subsets. If any of the sequence MF term sets is not empty, a set of core MF terms for the cluster is compiled from all sequence MF term sets, and, via further, intermediate steps, a set of terms to be ignored in cluster assessment is compiled. If none of the sequence term sets contains MF terms, the filter term set is an empty set. In the next step, the initial cluster term set is prepared as the union of all representative sequence term sets. From this set, any terms that are also found in the filter term set are removed (filtered), yielding the final cluster term set. Like the sequence term sets, the cluster term set can be split into MF, BP and CC subsets. The term-type specific sets of the representative sequences are compared with those of the cluster as a whole, respectively, to assess the functional coherence of the sequences in the cluster. Key term sets in the described process are highlighted in bold.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3584934&req=5

Figure 3: The sequence and cluster term sets involved in assessing cluster functional coherence. The sequences in a given domain sequence cluster are associated with different GO term sets via their parent proteins. Each sequence term set can be split into MF, BP and CC term subsets. If any of the sequence MF term sets is not empty, a set of core MF terms for the cluster is compiled from all sequence MF term sets, and, via further, intermediate steps, a set of terms to be ignored in cluster assessment is compiled. If none of the sequence term sets contains MF terms, the filter term set is an empty set. In the next step, the initial cluster term set is prepared as the union of all representative sequence term sets. From this set, any terms that are also found in the filter term set are removed (filtered), yielding the final cluster term set. Like the sequence term sets, the cluster term set can be split into MF, BP and CC subsets. The term-type specific sets of the representative sequences are compared with those of the cluster as a whole, respectively, to assess the functional coherence of the sequences in the cluster. Key term sets in the described process are highlighted in bold.
Mentions: All sequence clusters in the generated dendrogram are processed and assessed for functional coherence using the following assumptions and algorithms. The assessment protocol was designed according to the requirement of using GO annotation data; note that it could be much simpler for EC data, for example, as will become clear in the following. Figure 3 serves as a guideline through the cluster analysis workflow, also listing the most important term sets used in the process.

Bottom Line: An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone.For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically.The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

View Article: PubMed Central - HTML - PubMed

Affiliation: Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany. rentzschr@rki.de

ABSTRACT
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.

Show MeSH