Limits...
The contrasting properties of conservation and correlated phylogeny in protein functional residue prediction.

Manning JR, Jefferson ER, Barton GJ - BMC Bioinformatics (2008)

Bottom Line: SMERFS predicts sites that are significantly more solvent accessible compared to Williamson.Additional computation employed by Miner in calculation of phylogenetic trees did not produce improved results over SMERFS.SMERFS performance was improved by use of windows over alignment columns, illustrating the necessity of considering the local environment of positions when assessing their functional significance.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Life Sciences Research, University of Dundee, UK. jrm@compbio.dundee.ac.uk

ABSTRACT

Background: Amino acids responsible for structure, core function or specificity may be inferred from multiple protein sequence alignments where a limited set of residue types are tolerated. The rise in available protein sequences continues to increase the power of techniques based on this principle.

Results: A new algorithm, SMERFS, for predicting protein functional sites from multiple sequences alignments was compared to 14 conservation measures and to the MINER algorithm. Validation was performed on an automatically generated dataset of 1457 families derived from the protein interactions database SNAPPI-DB, and a smaller manually curated set of 148 families. The best performing measure overall was Williamson property entropy, with ROC0.1 scores of 0.0087 and 0.0114 for domain and small molecule contact prediction, respectively. The Lancet method performed worse than random on protein-protein interaction site prediction (ROC0.1 score of 0.0008). The SMERFS algorithm gave similar accuracy to the phylogenetic tree-based MINER algorithm but was superior to Williamson in prediction of non-catalytic transient complex interfaces. SMERFS predicts sites that are significantly more solvent accessible compared to Williamson.

Conclusion: Williamson property entropy is the the best performing of 14 conservation measures examined. The difference in performance of SMERFS relative to Williamson in manually defined complexes was dependent on complex type. The best choice of analysis method is therefore dependent on the system of interest. Additional computation employed by Miner in calculation of phylogenetic trees did not produce improved results over SMERFS. SMERFS performance was improved by use of windows over alignment columns, illustrating the necessity of considering the local environment of positions when assessing their functional significance.

Show MeSH
Fragment of an Example Alignment. Illustration of the difference between highly conserved positions likely to be responsible for core structure and/or function and specificity-defining (SD) positions of a multiple sequence alignment. Columns 1 and 2 illustrate positions crucial in all family members. Column 3 shows a similar, though less stringent global pattern of conservation. Column 4 in contrast represents an SD position, where only a single amino acid is tolerated by each subfamily. Column 5 represents a non-conserved position for comparison.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2267696&req=5

Figure 1: Fragment of an Example Alignment. Illustration of the difference between highly conserved positions likely to be responsible for core structure and/or function and specificity-defining (SD) positions of a multiple sequence alignment. Columns 1 and 2 illustrate positions crucial in all family members. Column 3 shows a similar, though less stringent global pattern of conservation. Column 4 in contrast represents an SD position, where only a single amino acid is tolerated by each subfamily. Column 5 represents a non-conserved position for comparison.

Mentions: The most common approach for predicting functional residues from the amino acid sequence has been to exploit the evolutionary information present in an accurate multiple protein sequence alignment. Providing the sequences are sufficiently diverse, the location of positions in the alignment with invariant, or highly "conserved" amino acids may suggest structural or functional importance. A number of different scoring functions have been developed to quantify conservation of protein sequence alignment positions as reviewed in [6]. In recent work by Capra and Singh [7], important positions in a sequence alignment are located by differentiating amino acid distributions at positions under evolutionary pressure from those of positions that are not. All these techniques will identify columns 1–4 of the multiple alignment fragment shown in Figure 1 as conserved, but will not discriminate between columns 3 and 4. Since in a multiple sequence alignment the sequences are grouped by overall similarity, column 4 suggests that the position is important to all sequences, but conservation of Lys in one subfamily and Glu in the other indicates that the position may have a role in defining the structural or functional specificity within the protein family. A number of algorithms seek explicitly to identify such Specificity Defining (SD) positions and so discriminate between columns 3 and 4. For example, the AMAS algorithm [8] identifies SD positions that have conserved physicochemical properties within pre-defined sub-families of proteins (e.g. +ve charge), yet exhibit different properties between the sub-families (e.g. +ve charge compared to -ve). In the 'Evolutionary Trace' (ET) technique [9] alignments are first sub-grouped by the use of partitions on a phylogenetic tree, and predictions are interpreted by mapping onto protein structures. ET originally required the tree partition to be chosen manually, but subsequent developments have attempted to address this issue [10,11]. In contrast to AMAS, and ET, the 'Sequence Space' algorithm [12] which represents sequences of an alignment as vectors in high dimensional space, does not require pre-grouping of the sequences. In Sequence Space, principal components analysis is employed to derive both the principal sequence subgroupings, and the SD positions characteristic of each group. More recently, Marttinen [13] applied a Bayesian statistical approach in an analogous manner, again deriving both the optimal subgrouping and functionally relevant positions of an alignment.


The contrasting properties of conservation and correlated phylogeny in protein functional residue prediction.

Manning JR, Jefferson ER, Barton GJ - BMC Bioinformatics (2008)

Fragment of an Example Alignment. Illustration of the difference between highly conserved positions likely to be responsible for core structure and/or function and specificity-defining (SD) positions of a multiple sequence alignment. Columns 1 and 2 illustrate positions crucial in all family members. Column 3 shows a similar, though less stringent global pattern of conservation. Column 4 in contrast represents an SD position, where only a single amino acid is tolerated by each subfamily. Column 5 represents a non-conserved position for comparison.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2267696&req=5

Figure 1: Fragment of an Example Alignment. Illustration of the difference between highly conserved positions likely to be responsible for core structure and/or function and specificity-defining (SD) positions of a multiple sequence alignment. Columns 1 and 2 illustrate positions crucial in all family members. Column 3 shows a similar, though less stringent global pattern of conservation. Column 4 in contrast represents an SD position, where only a single amino acid is tolerated by each subfamily. Column 5 represents a non-conserved position for comparison.
Mentions: The most common approach for predicting functional residues from the amino acid sequence has been to exploit the evolutionary information present in an accurate multiple protein sequence alignment. Providing the sequences are sufficiently diverse, the location of positions in the alignment with invariant, or highly "conserved" amino acids may suggest structural or functional importance. A number of different scoring functions have been developed to quantify conservation of protein sequence alignment positions as reviewed in [6]. In recent work by Capra and Singh [7], important positions in a sequence alignment are located by differentiating amino acid distributions at positions under evolutionary pressure from those of positions that are not. All these techniques will identify columns 1–4 of the multiple alignment fragment shown in Figure 1 as conserved, but will not discriminate between columns 3 and 4. Since in a multiple sequence alignment the sequences are grouped by overall similarity, column 4 suggests that the position is important to all sequences, but conservation of Lys in one subfamily and Glu in the other indicates that the position may have a role in defining the structural or functional specificity within the protein family. A number of algorithms seek explicitly to identify such Specificity Defining (SD) positions and so discriminate between columns 3 and 4. For example, the AMAS algorithm [8] identifies SD positions that have conserved physicochemical properties within pre-defined sub-families of proteins (e.g. +ve charge), yet exhibit different properties between the sub-families (e.g. +ve charge compared to -ve). In the 'Evolutionary Trace' (ET) technique [9] alignments are first sub-grouped by the use of partitions on a phylogenetic tree, and predictions are interpreted by mapping onto protein structures. ET originally required the tree partition to be chosen manually, but subsequent developments have attempted to address this issue [10,11]. In contrast to AMAS, and ET, the 'Sequence Space' algorithm [12] which represents sequences of an alignment as vectors in high dimensional space, does not require pre-grouping of the sequences. In Sequence Space, principal components analysis is employed to derive both the principal sequence subgroupings, and the SD positions characteristic of each group. More recently, Marttinen [13] applied a Bayesian statistical approach in an analogous manner, again deriving both the optimal subgrouping and functionally relevant positions of an alignment.

Bottom Line: SMERFS predicts sites that are significantly more solvent accessible compared to Williamson.Additional computation employed by Miner in calculation of phylogenetic trees did not produce improved results over SMERFS.SMERFS performance was improved by use of windows over alignment columns, illustrating the necessity of considering the local environment of positions when assessing their functional significance.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Life Sciences Research, University of Dundee, UK. jrm@compbio.dundee.ac.uk

ABSTRACT

Background: Amino acids responsible for structure, core function or specificity may be inferred from multiple protein sequence alignments where a limited set of residue types are tolerated. The rise in available protein sequences continues to increase the power of techniques based on this principle.

Results: A new algorithm, SMERFS, for predicting protein functional sites from multiple sequences alignments was compared to 14 conservation measures and to the MINER algorithm. Validation was performed on an automatically generated dataset of 1457 families derived from the protein interactions database SNAPPI-DB, and a smaller manually curated set of 148 families. The best performing measure overall was Williamson property entropy, with ROC0.1 scores of 0.0087 and 0.0114 for domain and small molecule contact prediction, respectively. The Lancet method performed worse than random on protein-protein interaction site prediction (ROC0.1 score of 0.0008). The SMERFS algorithm gave similar accuracy to the phylogenetic tree-based MINER algorithm but was superior to Williamson in prediction of non-catalytic transient complex interfaces. SMERFS predicts sites that are significantly more solvent accessible compared to Williamson.

Conclusion: Williamson property entropy is the the best performing of 14 conservation measures examined. The difference in performance of SMERFS relative to Williamson in manually defined complexes was dependent on complex type. The best choice of analysis method is therefore dependent on the system of interest. Additional computation employed by Miner in calculation of phylogenetic trees did not produce improved results over SMERFS. SMERFS performance was improved by use of windows over alignment columns, illustrating the necessity of considering the local environment of positions when assessing their functional significance.

Show MeSH