Limits...
Prediction of protein domain with mRMR feature selection and analysis.

Li BQ, Hu LL, Chen L, Feng KY, Cai YD, Chou KC - PLoS ONE (2012)

Bottom Line: With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation.The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28-40% higher than those by the existing method on the same benchmark dataset.Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.

View Article: PubMed Central - PubMed

Affiliation: Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China.

ABSTRACT
The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28-40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.

Show MeSH
A plot to show the change of the MCC values versus the feature numbers with different window sizes.The IFS curves were drawn based on the data in Online Supporting Information S3. The MCC value reached the peak when the number of feature = 360 and the window size = 13. The 360 features thus obtained were used to form the optimal feature set for the protein domain predictor. Purple line is for the case of size-17 window, green for size-15 window, and brown for size-17 window. See the text for further explanation.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3376124&req=5

pone-0039308-g001: A plot to show the change of the MCC values versus the feature numbers with different window sizes.The IFS curves were drawn based on the data in Online Supporting Information S3. The MCC value reached the peak when the number of feature = 360 and the window size = 13. The 360 features thus obtained were used to form the optimal feature set for the protein domain predictor. Purple line is for the case of size-17 window, green for size-15 window, and brown for size-17 window. See the text for further explanation.

Mentions: In Section 2.9 of Materials and Methods, by setting 403 for N and 5 for the feature-increasing gap, 80 individual predictors corresponding to 80 feature subsets were established for predicting the protein domain sites in the sequence samples generated by the size-13 sliding window. Listed in the Online Supporting Information S3 are the rates of prediction accuracy, specificity, sensitivity and MCC (cf. 14) obtained by each of the 80 predictors. Shown in Fig. 1 is the IFS curve plotted based on the data in Online Supporting Information S3. The same calculations were also carried out for the size-15 and size-17 windows, and the corresponding results were also plotted in Fig. 1, from which we can see that the predictor based on the size-13 window outperformed the other two, and that the maximal MCC was 0.342 when 360 features were included. These 360 features were deemed to form the optimal feature set of our classifier. With such a classifier, the prediction sensitivity, specificity and accuracy were 0.577, 0.768 and 0.704 respectively (Table 1). The optimal 360 features were given in the Online Supporting Information S4. After taking the IFS procedure (cf. In Sections 2.9 and 2.10 of Materials and Methods), we obtained the 195 final optimal features as given in the Online Supporting Information S5.


Prediction of protein domain with mRMR feature selection and analysis.

Li BQ, Hu LL, Chen L, Feng KY, Cai YD, Chou KC - PLoS ONE (2012)

A plot to show the change of the MCC values versus the feature numbers with different window sizes.The IFS curves were drawn based on the data in Online Supporting Information S3. The MCC value reached the peak when the number of feature = 360 and the window size = 13. The 360 features thus obtained were used to form the optimal feature set for the protein domain predictor. Purple line is for the case of size-17 window, green for size-15 window, and brown for size-17 window. See the text for further explanation.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3376124&req=5

pone-0039308-g001: A plot to show the change of the MCC values versus the feature numbers with different window sizes.The IFS curves were drawn based on the data in Online Supporting Information S3. The MCC value reached the peak when the number of feature = 360 and the window size = 13. The 360 features thus obtained were used to form the optimal feature set for the protein domain predictor. Purple line is for the case of size-17 window, green for size-15 window, and brown for size-17 window. See the text for further explanation.
Mentions: In Section 2.9 of Materials and Methods, by setting 403 for N and 5 for the feature-increasing gap, 80 individual predictors corresponding to 80 feature subsets were established for predicting the protein domain sites in the sequence samples generated by the size-13 sliding window. Listed in the Online Supporting Information S3 are the rates of prediction accuracy, specificity, sensitivity and MCC (cf. 14) obtained by each of the 80 predictors. Shown in Fig. 1 is the IFS curve plotted based on the data in Online Supporting Information S3. The same calculations were also carried out for the size-15 and size-17 windows, and the corresponding results were also plotted in Fig. 1, from which we can see that the predictor based on the size-13 window outperformed the other two, and that the maximal MCC was 0.342 when 360 features were included. These 360 features were deemed to form the optimal feature set of our classifier. With such a classifier, the prediction sensitivity, specificity and accuracy were 0.577, 0.768 and 0.704 respectively (Table 1). The optimal 360 features were given in the Online Supporting Information S4. After taking the IFS procedure (cf. In Sections 2.9 and 2.10 of Materials and Methods), we obtained the 195 final optimal features as given in the Online Supporting Information S5.

Bottom Line: With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation.The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28-40% higher than those by the existing method on the same benchmark dataset.Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.

View Article: PubMed Central - PubMed

Affiliation: Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China.

ABSTRACT
The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28-40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.

Show MeSH