Limits...
A grammar inference approach for predicting kinase specific phosphorylation sites.

Datta S, Mukhopadhyay S - PLoS ONE (2015)

Bottom Line: Extensive experiments on several datasets generated by us reveal that, our inferred grammar successfully predicts phosphorylation sites in a kinase specific manner.It performs significantly better when compared with the other existing phosphorylation site prediction methods.We have also compared our inferred DSFA with two other GI inference algorithms.

View Article: PubMed Central - PubMed

Affiliation: Department of Biophysics, Molecular Biology and Bioinformatics and Distributed Information Centre for Bioinformatics, University of Calcutta, Kolkata, West Bengal, India.

ABSTRACT
Kinase mediated phosphorylation site detection is the key mechanism of post translational mechanism that plays an important role in regulating various cellular processes and phenotypes. Many diseases, like cancer are related with the signaling defects which are associated with protein phosphorylation. Characterizing the protein kinases and their substrates enhances our ability to understand the mechanism of protein phosphorylation and extends our knowledge of signaling network; thereby helping us to treat such diseases. Experimental methods for predicting phosphorylation sites are labour intensive and expensive. Also, manifold increase of protein sequences in the databanks over the years necessitates the improvement of high speed and accurate computational methods for predicting phosphorylation sites in protein sequences. Till date, a number of computational methods have been proposed by various researchers in predicting phosphorylation sites, but there remains much scope of improvement. In this communication, we present a simple and novel method based on Grammatical Inference (GI) approach to automate the prediction of kinase specific phosphorylation sites. In this regard, we have used a popular GI algorithm Alergia to infer Deterministic Stochastic Finite State Automata (DSFA) which equally represents the regular grammar corresponding to the phosphorylation sites. Extensive experiments on several datasets generated by us reveal that, our inferred grammar successfully predicts phosphorylation sites in a kinase specific manner. It performs significantly better when compared with the other existing phosphorylation site prediction methods. We have also compared our inferred DSFA with two other GI inference algorithms. The DSFA generated by our method performs superior which indicates that our method is robust and has a potential for predicting the phosphorylation sites in a kinase specific manner.

Show MeSH

Related in: MedlinePlus

ROC curve by varying threshold values for PHSDB dataset for the four kinases PKA, PKC, MAPK and CK2.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4401752&req=5

pone.0122294.g005: ROC curve by varying threshold values for PHSDB dataset for the four kinases PKA, PKC, MAPK and CK2.

Mentions: Positive and negative sequences for each kinase family are downloaded from PHOSPHO.ELM database and CDHIT program is run on these sequences to obtain the non-homologous dataset. Next, the positive data is divided into training and test datasets. We varied the size of the training set by taking 10% through 80% of the whole positive data. The accuracy of our method with various training data sizes is shown in Fig 4. We found from the Fig 4, that a 60% training size is sufficient to achieve the best accuracy. Hence, we have divided the dataset into 6:4 ratios, where 60% data are used as training dataset for constructing the DSFA for further explanation and comparison purpose. Taking the remaining 40% data and the negative sequences, a test dataset is constructed. In test dataset, the ratio of positive and negative sequences is kept at 1:10 to avoid any biased prediction. We have named this dataset PHSDB. The number of positive and negative samples in training and test dataset for four kinases are summarized in Table 1. Furthermore, we have varied threshold value from 0.001 to 0.01 for all the kinases. We have varied the window size from 5 to 21 and have found that for window size 7 our proposed method obtains best result. Hence we have fixed the window size of our method at 7. In order to evaluate the performance of our method, we have generated Receiver Operating Curve (ROC) for each of the kinase specific phosphorylation site predictor. ROC curve shows the trade-off between True Positive Rate (TPR) i.e., Sensitivity and False Positive Rate (FPR) i.e., 1-Specificity. ROC curve is obtained by varying the threshold from 0.01 to 0.004. The ROC curve for PHSDB dataset is shown in Fig 5. The ROC curves for all the kinases almost reach100% sensitivity with atleast 15% specificity. Moreover, from the Fig 5 we can see that for all the kinases, at the threshold value nearer to 0.008, a balanced and higher sensitivity and specificity is obtained. While decreasing the threshold, the corresponding specificity also decreases. Hence for the further experimentation we have used the threshold value 0.008 for our method. We have also calculated the precision, recall, accuracy and F-measure for PHSDB dataset for the threshold 0.008 as shown in Table 2. The table shows that our method can predict phosphorylation sites for PKA, PKC, MAPK and CK2 with 96.11%, 97.38%, 96.54% and 96.12% accuracy respectively and also quite high F-measure values of 0.7933, 0.8553, 0.8240 and 0.7999 respectively. It indicates that our method yields sufficiently high precision and recall values which in other term means that our proposed approach can predict positive and negative instances quite accurately. To validate our result from ROC curve that at threshold value 0.008 yields the best result, we have also plotted change in accuracy with various threshold values. The plot is shown in S1–S4 Figs. demonstrates that at threshold value 0.008 our method obtains best accuracy.


A grammar inference approach for predicting kinase specific phosphorylation sites.

Datta S, Mukhopadhyay S - PLoS ONE (2015)

ROC curve by varying threshold values for PHSDB dataset for the four kinases PKA, PKC, MAPK and CK2.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4401752&req=5

pone.0122294.g005: ROC curve by varying threshold values for PHSDB dataset for the four kinases PKA, PKC, MAPK and CK2.
Mentions: Positive and negative sequences for each kinase family are downloaded from PHOSPHO.ELM database and CDHIT program is run on these sequences to obtain the non-homologous dataset. Next, the positive data is divided into training and test datasets. We varied the size of the training set by taking 10% through 80% of the whole positive data. The accuracy of our method with various training data sizes is shown in Fig 4. We found from the Fig 4, that a 60% training size is sufficient to achieve the best accuracy. Hence, we have divided the dataset into 6:4 ratios, where 60% data are used as training dataset for constructing the DSFA for further explanation and comparison purpose. Taking the remaining 40% data and the negative sequences, a test dataset is constructed. In test dataset, the ratio of positive and negative sequences is kept at 1:10 to avoid any biased prediction. We have named this dataset PHSDB. The number of positive and negative samples in training and test dataset for four kinases are summarized in Table 1. Furthermore, we have varied threshold value from 0.001 to 0.01 for all the kinases. We have varied the window size from 5 to 21 and have found that for window size 7 our proposed method obtains best result. Hence we have fixed the window size of our method at 7. In order to evaluate the performance of our method, we have generated Receiver Operating Curve (ROC) for each of the kinase specific phosphorylation site predictor. ROC curve shows the trade-off between True Positive Rate (TPR) i.e., Sensitivity and False Positive Rate (FPR) i.e., 1-Specificity. ROC curve is obtained by varying the threshold from 0.01 to 0.004. The ROC curve for PHSDB dataset is shown in Fig 5. The ROC curves for all the kinases almost reach100% sensitivity with atleast 15% specificity. Moreover, from the Fig 5 we can see that for all the kinases, at the threshold value nearer to 0.008, a balanced and higher sensitivity and specificity is obtained. While decreasing the threshold, the corresponding specificity also decreases. Hence for the further experimentation we have used the threshold value 0.008 for our method. We have also calculated the precision, recall, accuracy and F-measure for PHSDB dataset for the threshold 0.008 as shown in Table 2. The table shows that our method can predict phosphorylation sites for PKA, PKC, MAPK and CK2 with 96.11%, 97.38%, 96.54% and 96.12% accuracy respectively and also quite high F-measure values of 0.7933, 0.8553, 0.8240 and 0.7999 respectively. It indicates that our method yields sufficiently high precision and recall values which in other term means that our proposed approach can predict positive and negative instances quite accurately. To validate our result from ROC curve that at threshold value 0.008 yields the best result, we have also plotted change in accuracy with various threshold values. The plot is shown in S1–S4 Figs. demonstrates that at threshold value 0.008 our method obtains best accuracy.

Bottom Line: Extensive experiments on several datasets generated by us reveal that, our inferred grammar successfully predicts phosphorylation sites in a kinase specific manner.It performs significantly better when compared with the other existing phosphorylation site prediction methods.We have also compared our inferred DSFA with two other GI inference algorithms.

View Article: PubMed Central - PubMed

Affiliation: Department of Biophysics, Molecular Biology and Bioinformatics and Distributed Information Centre for Bioinformatics, University of Calcutta, Kolkata, West Bengal, India.

ABSTRACT
Kinase mediated phosphorylation site detection is the key mechanism of post translational mechanism that plays an important role in regulating various cellular processes and phenotypes. Many diseases, like cancer are related with the signaling defects which are associated with protein phosphorylation. Characterizing the protein kinases and their substrates enhances our ability to understand the mechanism of protein phosphorylation and extends our knowledge of signaling network; thereby helping us to treat such diseases. Experimental methods for predicting phosphorylation sites are labour intensive and expensive. Also, manifold increase of protein sequences in the databanks over the years necessitates the improvement of high speed and accurate computational methods for predicting phosphorylation sites in protein sequences. Till date, a number of computational methods have been proposed by various researchers in predicting phosphorylation sites, but there remains much scope of improvement. In this communication, we present a simple and novel method based on Grammatical Inference (GI) approach to automate the prediction of kinase specific phosphorylation sites. In this regard, we have used a popular GI algorithm Alergia to infer Deterministic Stochastic Finite State Automata (DSFA) which equally represents the regular grammar corresponding to the phosphorylation sites. Extensive experiments on several datasets generated by us reveal that, our inferred grammar successfully predicts phosphorylation sites in a kinase specific manner. It performs significantly better when compared with the other existing phosphorylation site prediction methods. We have also compared our inferred DSFA with two other GI inference algorithms. The DSFA generated by our method performs superior which indicates that our method is robust and has a potential for predicting the phosphorylation sites in a kinase specific manner.

Show MeSH
Related in: MedlinePlus