Limits...
Sequence based residue depth prediction using evolutionary information and predicted secondary structure.

Zhang H, Zhang T, Chen K, Shen S, Ruan J, Kurgan L - BMC Bioinformatics (2008)

Bottom Line: The mean absolute and the mean relative errors are shown to be statistically significantly better when compared with a method recently proposed by Yuan and Wang [Proteins 2008; 70:509-516].The results show that three-fold cross validation underestimates the variability of the prediction quality when compared with the results based on the ten-fold cross validation.The proposed method, RDPred, provides statistically significantly better predictions of residue depth when compared with the competing method.

View Article: PubMed Central - HTML - PubMed

Affiliation: College of Mathematical Science and LPMC, Nankai University, Tianjin, PR China. zerohua@gmail.com

ABSTRACT

Background: Residue depth allows determining how deeply a given residue is buried, in contrast to the solvent accessibility that differentiates between buried and solvent-exposed residues. When compared with the solvent accessibility, the depth allows studying deep-level structures and functional sites, and formation of the protein folding nucleus. Accurate prediction of residue depth would provide valuable information for fold recognition, prediction of functional sites, and protein design.

Results: A new method, RDPred, for the real-value depth prediction from protein sequence is proposed. RDPred combines information extracted from the sequence, PSI-BLAST scoring matrices, and secondary structure predicted with PSIPRED. Three-fold/ten-fold cross validation based tests performed on three independent, low-identity datasets show that the distance based depth (computed using MSMS) predicted by RDPred is characterized by 0.67/0.67, 0.66/0.67, and 0.64/0.65 correlation with the actual depth, by the mean absolute errors equal 0.56/0.56, 0.61/0.60, and 0.58/0.57, and by the mean relative errors equal 17.0%/16.9%, 18.2%/18.1%, and 17.7%/17.6%, respectively. The mean absolute and the mean relative errors are shown to be statistically significantly better when compared with a method recently proposed by Yuan and Wang [Proteins 2008; 70:509-516]. The results show that three-fold cross validation underestimates the variability of the prediction quality when compared with the results based on the ten-fold cross validation. We also show that the hydrophilic and flexible residues are predicted more accurately than hydrophobic and rigid residues. Similarly, the charged residues that include Lys, Glu, Asp, and Arg are the most accurately predicted. Our analysis reveals that evolutionary information encoded using PSSM is characterized by stronger correlation with the depth for hydrophilic amino acids (AAs) and aliphatic AAs when compared with hydrophobic AAs and aromatic AAs. Finally, we show that the secondary structure of coils and strands is useful in depth prediction, in contrast to helices that have relatively uniform distribution over the protein depth. Application of the predicted residue depth to prediction of buried/exposed residues shows consistent improvements in detection rates of both buried and exposed residues when compared with the competing method. Finally, we contrasted the prediction performance among distance based (MSMS and DPX) and volume based (SADIC) depth definitions. We found that the distance based indices are harder to predict due to the more complex nature of the corresponding depth profiles.

Conclusion: The proposed method, RDPred, provides statistically significantly better predictions of residue depth when compared with the competing method. The predicted depth can be used to provide improved prediction of both buried and exposed residues. The prediction of exposed residues has implications in characterization/prediction of interactions with ligands and other proteins, while the prediction of buried residues could be used in the context of folding predictions and simulations.

Show MeSH

Related in: MedlinePlus

Distribution of PSSM features ranked based on PCC values of the corresponding, selected features. The shading of the individual cells (features) corresponds to their absolute average (over 3 folds in YW923 dataset) PCC values: black for PCC ≥ 0.3, dark gray for 0.2 ≤ PCC < 0.3, gray for 0.1 < PCC ≤ 0.2), light gray for the remaining selected features, and white for features that were not selected.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2567998&req=5

Figure 10: Distribution of PSSM features ranked based on PCC values of the corresponding, selected features. The shading of the individual cells (features) corresponds to their absolute average (over 3 folds in YW923 dataset) PCC values: black for PCC ≥ 0.3, dark gray for 0.2 ≤ PCC < 0.3, gray for 0.1 < PCC ≤ 0.2), light gray for the remaining selected features, and white for features that were not selected.

Mentions: The majority of RDPred's features (96 out of 125) correspond to PSSM values. At the same time, only 96 out of the total of 300 PSSM values were selected. Figure 10 visualizes the distribution of the selected PSSM features with respect to their positions in the window. Each cell in the Figure corresponds to one PSSM based feature and the shades of gray represents the absolute average PCC values (the value used during feature selection), i.e., darker color corresponds to stronger correlation, while white color shows which features were not selected. We also computed the mean PCC values for each row and column in Figure 10. As expected, the distribution of PCC values is relatively symmetric with respect to the central position in the window, which implies that the residues at the N-termini and C-termini in the local window have similar impact on the prediction performed for the central residue. The central residue (denoted by position 0) has the largest mean PCC, while other positions are characterized by lower mean PCC values. In general, the mean PCC values decline with the increasing distance from the central position. One exception are positions -2 and 2 (two residues away from the central residue) in which case the mean PCC is smaller than that for positions +/- 1 and +/- 3. While it is obvious that the positions immediately adjacent to the predicted residue should have large impact on the prediction, we hypothesize that the low average PCC value for the +/- 2 positions is due to limited interactions between these residues and the central residue. For instance, in case of helical structure, the residues at +/- 3 or +/- 4 positions would form hydrogen bond with the central AAs, in contrast to the residues at +/- 2 position. The strongest PCC values for +/- 3 or +/- 4 positions are observed for Glu (E), Gln (Q), and Lys (K), and these residues were shown to be strongly associated with the formation of helices [62], which further substantiates our hypothesis.


Sequence based residue depth prediction using evolutionary information and predicted secondary structure.

Zhang H, Zhang T, Chen K, Shen S, Ruan J, Kurgan L - BMC Bioinformatics (2008)

Distribution of PSSM features ranked based on PCC values of the corresponding, selected features. The shading of the individual cells (features) corresponds to their absolute average (over 3 folds in YW923 dataset) PCC values: black for PCC ≥ 0.3, dark gray for 0.2 ≤ PCC < 0.3, gray for 0.1 < PCC ≤ 0.2), light gray for the remaining selected features, and white for features that were not selected.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2567998&req=5

Figure 10: Distribution of PSSM features ranked based on PCC values of the corresponding, selected features. The shading of the individual cells (features) corresponds to their absolute average (over 3 folds in YW923 dataset) PCC values: black for PCC ≥ 0.3, dark gray for 0.2 ≤ PCC < 0.3, gray for 0.1 < PCC ≤ 0.2), light gray for the remaining selected features, and white for features that were not selected.
Mentions: The majority of RDPred's features (96 out of 125) correspond to PSSM values. At the same time, only 96 out of the total of 300 PSSM values were selected. Figure 10 visualizes the distribution of the selected PSSM features with respect to their positions in the window. Each cell in the Figure corresponds to one PSSM based feature and the shades of gray represents the absolute average PCC values (the value used during feature selection), i.e., darker color corresponds to stronger correlation, while white color shows which features were not selected. We also computed the mean PCC values for each row and column in Figure 10. As expected, the distribution of PCC values is relatively symmetric with respect to the central position in the window, which implies that the residues at the N-termini and C-termini in the local window have similar impact on the prediction performed for the central residue. The central residue (denoted by position 0) has the largest mean PCC, while other positions are characterized by lower mean PCC values. In general, the mean PCC values decline with the increasing distance from the central position. One exception are positions -2 and 2 (two residues away from the central residue) in which case the mean PCC is smaller than that for positions +/- 1 and +/- 3. While it is obvious that the positions immediately adjacent to the predicted residue should have large impact on the prediction, we hypothesize that the low average PCC value for the +/- 2 positions is due to limited interactions between these residues and the central residue. For instance, in case of helical structure, the residues at +/- 3 or +/- 4 positions would form hydrogen bond with the central AAs, in contrast to the residues at +/- 2 position. The strongest PCC values for +/- 3 or +/- 4 positions are observed for Glu (E), Gln (Q), and Lys (K), and these residues were shown to be strongly associated with the formation of helices [62], which further substantiates our hypothesis.

Bottom Line: The mean absolute and the mean relative errors are shown to be statistically significantly better when compared with a method recently proposed by Yuan and Wang [Proteins 2008; 70:509-516].The results show that three-fold cross validation underestimates the variability of the prediction quality when compared with the results based on the ten-fold cross validation.The proposed method, RDPred, provides statistically significantly better predictions of residue depth when compared with the competing method.

View Article: PubMed Central - HTML - PubMed

Affiliation: College of Mathematical Science and LPMC, Nankai University, Tianjin, PR China. zerohua@gmail.com

ABSTRACT

Background: Residue depth allows determining how deeply a given residue is buried, in contrast to the solvent accessibility that differentiates between buried and solvent-exposed residues. When compared with the solvent accessibility, the depth allows studying deep-level structures and functional sites, and formation of the protein folding nucleus. Accurate prediction of residue depth would provide valuable information for fold recognition, prediction of functional sites, and protein design.

Results: A new method, RDPred, for the real-value depth prediction from protein sequence is proposed. RDPred combines information extracted from the sequence, PSI-BLAST scoring matrices, and secondary structure predicted with PSIPRED. Three-fold/ten-fold cross validation based tests performed on three independent, low-identity datasets show that the distance based depth (computed using MSMS) predicted by RDPred is characterized by 0.67/0.67, 0.66/0.67, and 0.64/0.65 correlation with the actual depth, by the mean absolute errors equal 0.56/0.56, 0.61/0.60, and 0.58/0.57, and by the mean relative errors equal 17.0%/16.9%, 18.2%/18.1%, and 17.7%/17.6%, respectively. The mean absolute and the mean relative errors are shown to be statistically significantly better when compared with a method recently proposed by Yuan and Wang [Proteins 2008; 70:509-516]. The results show that three-fold cross validation underestimates the variability of the prediction quality when compared with the results based on the ten-fold cross validation. We also show that the hydrophilic and flexible residues are predicted more accurately than hydrophobic and rigid residues. Similarly, the charged residues that include Lys, Glu, Asp, and Arg are the most accurately predicted. Our analysis reveals that evolutionary information encoded using PSSM is characterized by stronger correlation with the depth for hydrophilic amino acids (AAs) and aliphatic AAs when compared with hydrophobic AAs and aromatic AAs. Finally, we show that the secondary structure of coils and strands is useful in depth prediction, in contrast to helices that have relatively uniform distribution over the protein depth. Application of the predicted residue depth to prediction of buried/exposed residues shows consistent improvements in detection rates of both buried and exposed residues when compared with the competing method. Finally, we contrasted the prediction performance among distance based (MSMS and DPX) and volume based (SADIC) depth definitions. We found that the distance based indices are harder to predict due to the more complex nature of the corresponding depth profiles.

Conclusion: The proposed method, RDPred, provides statistically significantly better predictions of residue depth when compared with the competing method. The predicted depth can be used to provide improved prediction of both buried and exposed residues. The prediction of exposed residues has implications in characterization/prediction of interactions with ligands and other proteins, while the prediction of buried residues could be used in the context of folding predictions and simulations.

Show MeSH
Related in: MedlinePlus