Limits...
Prediction of protein domain with mRMR feature selection and analysis.

Li BQ, Hu LL, Chen L, Feng KY, Cai YD, Chou KC - PLoS ONE (2012)

Bottom Line: With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation.The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28-40% higher than those by the existing method on the same benchmark dataset.Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.

View Article: PubMed Central - PubMed

Affiliation: Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China.

ABSTRACT
The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28-40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.

Show MeSH
A 2-dimensional histogram to characterize the PSSM features in the final optimal features set.(A) The impact on the domain prediction from the mutation to each of the 20 amino acid types. (B) The evolutional conservation status for each of the 13 subsites. See the text for further explanation.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3376124&req=5

pone-0039308-g003: A 2-dimensional histogram to characterize the PSSM features in the final optimal features set.(A) The impact on the domain prediction from the mutation to each of the 20 amino acid types. (B) The evolutional conservation status for each of the 13 subsites. See the text for further explanation.

Mentions: As mentioned above, among the 195 optimal features, 147 belonged to the PSSM conservation features and hence had the highest proportion. It can be clearly seen from Fig. 3A that each of the 20 different amino acid types would have different PSSM conservation impact in determining the protein domain site. In this regard, the amino acid N (asparagine) or D (aspartic acid) would have the highest impact, successively followed by G (glycine), R (arginine), and so forth. Interestingly, it has been reported that D, G and R were over-represented in protein interaction domains [98]. Besides, G was believed to be instrumental in defining the core domain and inter-domain regions of a protein [39]. Meanwhile, as shown in Fig. 3B, for the samples generated by the size-13 window (cf. Eq.2), the conservation status at the subsite 10 played the most important role in predicting the protein domain site, followed by the subsites 1, 2, 4, and 7. Furthermore, of the top ten features in the final optimal feature list, five were from the PSSM conservation features. The first one was the conservation status against residue M (methionine) at subsite 1 (index 3, “AA1_pssm_13”). The other four were the conservation status against residue A at subsite 12 (index 4, “AA12_pssm_1”), the conservation status against residue G at subsite 6 and site 4 (index 6 and index 7, “AA6_pssm_8” and AA4_pssm_8), and the conservation status against residue T (threonine) at subsite 2 (index 8, “AA2_pssm_17”)


Prediction of protein domain with mRMR feature selection and analysis.

Li BQ, Hu LL, Chen L, Feng KY, Cai YD, Chou KC - PLoS ONE (2012)

A 2-dimensional histogram to characterize the PSSM features in the final optimal features set.(A) The impact on the domain prediction from the mutation to each of the 20 amino acid types. (B) The evolutional conservation status for each of the 13 subsites. See the text for further explanation.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3376124&req=5

pone-0039308-g003: A 2-dimensional histogram to characterize the PSSM features in the final optimal features set.(A) The impact on the domain prediction from the mutation to each of the 20 amino acid types. (B) The evolutional conservation status for each of the 13 subsites. See the text for further explanation.
Mentions: As mentioned above, among the 195 optimal features, 147 belonged to the PSSM conservation features and hence had the highest proportion. It can be clearly seen from Fig. 3A that each of the 20 different amino acid types would have different PSSM conservation impact in determining the protein domain site. In this regard, the amino acid N (asparagine) or D (aspartic acid) would have the highest impact, successively followed by G (glycine), R (arginine), and so forth. Interestingly, it has been reported that D, G and R were over-represented in protein interaction domains [98]. Besides, G was believed to be instrumental in defining the core domain and inter-domain regions of a protein [39]. Meanwhile, as shown in Fig. 3B, for the samples generated by the size-13 window (cf. Eq.2), the conservation status at the subsite 10 played the most important role in predicting the protein domain site, followed by the subsites 1, 2, 4, and 7. Furthermore, of the top ten features in the final optimal feature list, five were from the PSSM conservation features. The first one was the conservation status against residue M (methionine) at subsite 1 (index 3, “AA1_pssm_13”). The other four were the conservation status against residue A at subsite 12 (index 4, “AA12_pssm_1”), the conservation status against residue G at subsite 6 and site 4 (index 6 and index 7, “AA6_pssm_8” and AA4_pssm_8), and the conservation status against residue T (threonine) at subsite 2 (index 8, “AA2_pssm_17”)

Bottom Line: With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation.The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28-40% higher than those by the existing method on the same benchmark dataset.Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.

View Article: PubMed Central - PubMed

Affiliation: Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China.

ABSTRACT
The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28-40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.

Show MeSH