Limits...
Function prediction from networks of local evolutionary similarity in protein structure.

Erdin S, Venner E, Lisewski AM, Lichtarge O - BMC Bioinformatics (2013)

Bottom Line: One proven computational strategy has been to group a few key functional amino acids into templates and search for these templates in other protein structures, so as to transfer function when a match is found.To this end, we previously developed Evolutionary Trace Annotation (ETA) and showed that diffusing known annotations over a network of template matches on a structural genomic scale improved predictions of function.We improve the accuracy and sensitivity of predictions by using multiple templates per protein structure when constructing networks of ETA matches and diffusing annotations.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, Texas 77030, USA.

ABSTRACT

Background: Annotating protein function with both high accuracy and sensitivity remains a major challenge in structural genomics. One proven computational strategy has been to group a few key functional amino acids into templates and search for these templates in other protein structures, so as to transfer function when a match is found. To this end, we previously developed Evolutionary Trace Annotation (ETA) and showed that diffusing known annotations over a network of template matches on a structural genomic scale improved predictions of function. In order to further increase sensitivity, we now let each protein contribute multiple templates rather than just one, and also let the template size vary.

Results: Retrospective benchmarks in 605 Structural Genomics enzymes showed that multiple templates increased sensitivity by up to 14% when combined with single template predictions even as they maintained the accuracy over 91%. Diffusing function globally on networks of single and multiple template matches marginally increased the area under the ROC curve over 0.97, but in a subset of proteins that could not be annotated by ETA, the network approach recovered annotations for the most confident 20-23 of 91 cases with 100% accuracy.

Conclusions: We improve the accuracy and sensitivity of predictions by using multiple templates per protein structure when constructing networks of ETA matches and diffusing annotations.

Show MeSH
(A) Graphical outline of Evolutionary Trace Annotation (ETA) pipeline in single template and multiple template modes. The numbers 1,2,3,4,5,6 denote the components of ETA pipeline; Evolutionary Trace, Template Picker, Paired-distance Matching, Support Vector Machine and Reciprocity respectively. (B) Graphical outline of network diffusion method.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3584919&req=5

Figure 1: (A) Graphical outline of Evolutionary Trace Annotation (ETA) pipeline in single template and multiple template modes. The numbers 1,2,3,4,5,6 denote the components of ETA pipeline; Evolutionary Trace, Template Picker, Paired-distance Matching, Support Vector Machine and Reciprocity respectively. (B) Graphical outline of network diffusion method.

Mentions: ETA follows an annotation strategy that consists of five steps, illustrated in both single and multiple templates ETA modes in Figure 1A, for the case of a mevalonate pyrophosphate decarboxylase from Mus musculus (PDB 3f0n; chain A; EC 4.1.1.33) (see Methods for details). First, ET ranks the evolutionary importance of the residues from the query protein, 3f0nA, for which the function is sought. These top-ranked ET residues usually form surface clusters, and ET identifies two significant clusters that may be functional sites: the first is {26K, 156S, 158S, 161R and 215M} and the second is {123G, 125A, 126S, 127S, 305D, 306A, 307G, 309N}. In single template mode, ETA picks template residues from the larger of the two ({305D, 307G, 123G, 126S and 127S} shown red in Figure 1A). But in multiple template mode, ETA also picks an additional five-residue template made of the entire first cluster (brown template in Figure 1A). Next, a paired-distance matching algorithm searches for geometric similarity between templates and structures from a subset of the Protein Data Bank (PDB) filtered for sequence identity and annotated with one or more known functions. These preliminary matches are filtered by a support vector machine (SVM) that selects a refined set that combines geometric similarity with ET rank similarity. To further increase specificity, ETA repeats the same steps in reverse: now templates are generated in the matched structures and searched for in the original query. Accepting only matches that fulfill such reciprocity reduces the likelihood that they arose due to random chance. In the example, ETA identifies mevalonate diphosphate decarboxylase from Staphylococcus aureus [29](PDB 2hk3; chain B; EC 4.1.1.33) as a reciprocal match in the single template mode, while human mevalonate diphosphate decarboxylase [30] (PDB 3d4j; chain B; EC 4.1.1.33) is added to reciprocal matches in multiple template mode. In the fifth step, ETA selects the function seen in a plurality of the target structures that matched reciprocally to the query. In our example, EC 4.1.1.33 is selected with one and two votes in single and multiple template modes, respectively.


Function prediction from networks of local evolutionary similarity in protein structure.

Erdin S, Venner E, Lisewski AM, Lichtarge O - BMC Bioinformatics (2013)

(A) Graphical outline of Evolutionary Trace Annotation (ETA) pipeline in single template and multiple template modes. The numbers 1,2,3,4,5,6 denote the components of ETA pipeline; Evolutionary Trace, Template Picker, Paired-distance Matching, Support Vector Machine and Reciprocity respectively. (B) Graphical outline of network diffusion method.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3584919&req=5

Figure 1: (A) Graphical outline of Evolutionary Trace Annotation (ETA) pipeline in single template and multiple template modes. The numbers 1,2,3,4,5,6 denote the components of ETA pipeline; Evolutionary Trace, Template Picker, Paired-distance Matching, Support Vector Machine and Reciprocity respectively. (B) Graphical outline of network diffusion method.
Mentions: ETA follows an annotation strategy that consists of five steps, illustrated in both single and multiple templates ETA modes in Figure 1A, for the case of a mevalonate pyrophosphate decarboxylase from Mus musculus (PDB 3f0n; chain A; EC 4.1.1.33) (see Methods for details). First, ET ranks the evolutionary importance of the residues from the query protein, 3f0nA, for which the function is sought. These top-ranked ET residues usually form surface clusters, and ET identifies two significant clusters that may be functional sites: the first is {26K, 156S, 158S, 161R and 215M} and the second is {123G, 125A, 126S, 127S, 305D, 306A, 307G, 309N}. In single template mode, ETA picks template residues from the larger of the two ({305D, 307G, 123G, 126S and 127S} shown red in Figure 1A). But in multiple template mode, ETA also picks an additional five-residue template made of the entire first cluster (brown template in Figure 1A). Next, a paired-distance matching algorithm searches for geometric similarity between templates and structures from a subset of the Protein Data Bank (PDB) filtered for sequence identity and annotated with one or more known functions. These preliminary matches are filtered by a support vector machine (SVM) that selects a refined set that combines geometric similarity with ET rank similarity. To further increase specificity, ETA repeats the same steps in reverse: now templates are generated in the matched structures and searched for in the original query. Accepting only matches that fulfill such reciprocity reduces the likelihood that they arose due to random chance. In the example, ETA identifies mevalonate diphosphate decarboxylase from Staphylococcus aureus [29](PDB 2hk3; chain B; EC 4.1.1.33) as a reciprocal match in the single template mode, while human mevalonate diphosphate decarboxylase [30] (PDB 3d4j; chain B; EC 4.1.1.33) is added to reciprocal matches in multiple template mode. In the fifth step, ETA selects the function seen in a plurality of the target structures that matched reciprocally to the query. In our example, EC 4.1.1.33 is selected with one and two votes in single and multiple template modes, respectively.

Bottom Line: One proven computational strategy has been to group a few key functional amino acids into templates and search for these templates in other protein structures, so as to transfer function when a match is found.To this end, we previously developed Evolutionary Trace Annotation (ETA) and showed that diffusing known annotations over a network of template matches on a structural genomic scale improved predictions of function.We improve the accuracy and sensitivity of predictions by using multiple templates per protein structure when constructing networks of ETA matches and diffusing annotations.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, Texas 77030, USA.

ABSTRACT

Background: Annotating protein function with both high accuracy and sensitivity remains a major challenge in structural genomics. One proven computational strategy has been to group a few key functional amino acids into templates and search for these templates in other protein structures, so as to transfer function when a match is found. To this end, we previously developed Evolutionary Trace Annotation (ETA) and showed that diffusing known annotations over a network of template matches on a structural genomic scale improved predictions of function. In order to further increase sensitivity, we now let each protein contribute multiple templates rather than just one, and also let the template size vary.

Results: Retrospective benchmarks in 605 Structural Genomics enzymes showed that multiple templates increased sensitivity by up to 14% when combined with single template predictions even as they maintained the accuracy over 91%. Diffusing function globally on networks of single and multiple template matches marginally increased the area under the ROC curve over 0.97, but in a subset of proteins that could not be annotated by ETA, the network approach recovered annotations for the most confident 20-23 of 91 cases with 100% accuracy.

Conclusions: We improve the accuracy and sensitivity of predictions by using multiple templates per protein structure when constructing networks of ETA matches and diffusing annotations.

Show MeSH