Limits...
Annotation extension through protein family annotation coherence metrics.

Bastos HP, Clarke LA, Couto FM - Front Genet (2013)

Bottom Line: As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class.We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension.Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families.

View Article: PubMed Central - PubMed

Affiliation: LaSIGE, Department of Informatics, Faculdade de Ciências, Universidade de Lisboa Lisboa, Portugal.

ABSTRACT
Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families.

No MeSH data available.


Outline of proposed methodology for annotation extension.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3795322&req=5

Figure 5: Outline of proposed methodology for annotation extension.

Mentions: In light of the results discussed above we propose a general methodology for extending GO annotations in protein families as depicted in Figure 5. Consider a set of protein families created by curators within a given biological knowledge domain. A certain level of functional similarity is inherently expected from these families. Following an initial collection of terms annotating each of these families a statistical enrichment can then ensue. The commonly used technique of statistical enrichment allows the filtering out of possible annotation terms that are not characteristic of a family. At this point (Step 1) additional manually created rules might be beneficial in order to capture not only statistical support but potentially biological meaning related to the specific context domain of the protein families. Following the process of selecting the relevant term annotations for a given family, functional annotation coherence in a family can be asserted through the use of groupwise semantic similarity metrics (Step 2). A protein family showing greater annotation coherence may supply sub-sets of protein (sequences) that can be used to create multiple sequence alignments. These can subsequently be used to create position-specific scoring matrices, hidden Markov models or other statistical models that can be used for classification. Also, any other available or obtainable protein feature from a sub-set of proteins sharing an annotation can theoretically be used with several machine learning techniques in order create individual GO term classifiers. Visualization methods can be helpful in making this procedure semi-automatic. Following that course of action subsuming annotation graphs, like the ones in Figures 3, 4, can be dynamically generated. These annotation graphs can also be made interactive in order to allow navigation through the individual nodes. Hence, considering that each node represents an annotation term, the graph can then be linked with the sub-set of proteins annotated by that term in a given family. This allows the selection of proteins which will contribute with features (sequences or otherwise) to construct the single GO term classifiers (Step 3). In turn, these classifiers can then be used for the purpose of extending functional annotation on incompletely annotated proteins within the given protein family (Step 4). By submitting the families to the annotation metrics the coherence differential can be gauged after each iteration of annotation extension (Step 5). It should be noted that the overall family coherence metrics used should be selected or customized in order to take into account the particular knowledge domain being assessed. Of particular notice is that extensions are done per annotation term, and each protein (and family) can have multiple functions and thus terms associated to them.


Annotation extension through protein family annotation coherence metrics.

Bastos HP, Clarke LA, Couto FM - Front Genet (2013)

Outline of proposed methodology for annotation extension.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3795322&req=5

Figure 5: Outline of proposed methodology for annotation extension.
Mentions: In light of the results discussed above we propose a general methodology for extending GO annotations in protein families as depicted in Figure 5. Consider a set of protein families created by curators within a given biological knowledge domain. A certain level of functional similarity is inherently expected from these families. Following an initial collection of terms annotating each of these families a statistical enrichment can then ensue. The commonly used technique of statistical enrichment allows the filtering out of possible annotation terms that are not characteristic of a family. At this point (Step 1) additional manually created rules might be beneficial in order to capture not only statistical support but potentially biological meaning related to the specific context domain of the protein families. Following the process of selecting the relevant term annotations for a given family, functional annotation coherence in a family can be asserted through the use of groupwise semantic similarity metrics (Step 2). A protein family showing greater annotation coherence may supply sub-sets of protein (sequences) that can be used to create multiple sequence alignments. These can subsequently be used to create position-specific scoring matrices, hidden Markov models or other statistical models that can be used for classification. Also, any other available or obtainable protein feature from a sub-set of proteins sharing an annotation can theoretically be used with several machine learning techniques in order create individual GO term classifiers. Visualization methods can be helpful in making this procedure semi-automatic. Following that course of action subsuming annotation graphs, like the ones in Figures 3, 4, can be dynamically generated. These annotation graphs can also be made interactive in order to allow navigation through the individual nodes. Hence, considering that each node represents an annotation term, the graph can then be linked with the sub-set of proteins annotated by that term in a given family. This allows the selection of proteins which will contribute with features (sequences or otherwise) to construct the single GO term classifiers (Step 3). In turn, these classifiers can then be used for the purpose of extending functional annotation on incompletely annotated proteins within the given protein family (Step 4). By submitting the families to the annotation metrics the coherence differential can be gauged after each iteration of annotation extension (Step 5). It should be noted that the overall family coherence metrics used should be selected or customized in order to take into account the particular knowledge domain being assessed. Of particular notice is that extensions are done per annotation term, and each protein (and family) can have multiple functions and thus terms associated to them.

Bottom Line: As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class.We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension.Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families.

View Article: PubMed Central - PubMed

Affiliation: LaSIGE, Department of Informatics, Faculdade de Ciências, Universidade de Lisboa Lisboa, Portugal.

ABSTRACT
Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families.

No MeSH data available.