Limits...
An ensemble method with hybrid features to identify extracellular matrix proteins.

Yang R, Zhang C, Gao R, Zhang L - PLoS ONE (2015)

Bottom Line: Moreover, when tested on a common independent dataset, our method also achieves significantly improved performance over ECMPP and ECMPRED.These results indicate that IECMP is an effective method for ECM protein prediction, which has a more balanced prediction capability for positive and negative samples.It is anticipated that the proposed method will provide significant information to fully decipher the molecular mechanisms of ECM-related biological processes and discover candidate drug targets.

View Article: PubMed Central - PubMed

Affiliation: School of Control Science and Engineering, Shandong University, Jinan, China.

ABSTRACT
The extracellular matrix (ECM) is a dynamic composite of secreted proteins that play important roles in numerous biological processes such as tissue morphogenesis, differentiation and homeostasis. Furthermore, various diseases are caused by the dysfunction of ECM proteins. Therefore, identifying these important ECM proteins may assist in understanding related biological processes and drug development. In view of the serious imbalance in the training dataset, a Random Forest-based ensemble method with hybrid features is developed in this paper to identify ECM proteins. Hybrid features are employed by incorporating sequence composition, physicochemical properties, evolutionary and structural information. The Information Gain Ratio and Incremental Feature Selection (IGR-IFS) methods are adopted to select the optimal features. Finally, the resulting predictor termed IECMP (Identify ECM Proteins) achieves an balanced accuracy of 86.4% using the 10-fold cross-validation on the training dataset, which is much higher than results obtained by other methods (ECMPRED: 71.0%, ECMPP: 77.8%). Moreover, when tested on a common independent dataset, our method also achieves significantly improved performance over ECMPP and ECMPRED. These results indicate that IECMP is an effective method for ECM protein prediction, which has a more balanced prediction capability for positive and negative samples. It is anticipated that the proposed method will provide significant information to fully decipher the molecular mechanisms of ECM-related biological processes and discover candidate drug targets. For public access, we develop a user-friendly web server for ECM protein identification that is freely accessible at http://iecmp.weka.cc.

No MeSH data available.


Related in: MedlinePlus

The feature type distributions in the original and optimal feature set.FFG: Frequencies of Functional Groups, PseAAC: Pseudo Amino Acid Composition, DWT: Discrete Wavelet Transformation, PSSM: Position Specific Scoring Matrix, SSI: Secondary Structural Information, FDI: Functional Domain Information.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4334504&req=5

pone.0117804.g004: The feature type distributions in the original and optimal feature set.FFG: Frequencies of Functional Groups, PseAAC: Pseudo Amino Acid Composition, DWT: Discrete Wavelet Transformation, PSSM: Position Specific Scoring Matrix, SSI: Secondary Structural Information, FDI: Functional Domain Information.

Mentions: The four kinds of features in Fig. 3 produce 10 types of feature vectors as given in Table 1. The feature type distributions in the original and optimal feature set are illustrated in Fig. 4. From Fig. 4, it is interesting to note that features from distribution are all in the optimal feature set. This phenomenon may be due to that the features from distribution reflect sequence order information, known to represent important properties for protein attribute prediction [23–25]. Furthermore, a much higher percent of features from functional domain information (15 out of 17) are selected from the original feature set. This finding is consistent with previous studies. In previous works, it was indicated that the key role of ECM in governing protein interactions is attributed to highly conserved domains of ECM proteins [63]. Although slightly less relevant, the other eight types of features also contribute to the identification of ECM proteins. The prediction model integrate multiple sources of descriptors for protein sequences in an attempt to enhance prediction performance.


An ensemble method with hybrid features to identify extracellular matrix proteins.

Yang R, Zhang C, Gao R, Zhang L - PLoS ONE (2015)

The feature type distributions in the original and optimal feature set.FFG: Frequencies of Functional Groups, PseAAC: Pseudo Amino Acid Composition, DWT: Discrete Wavelet Transformation, PSSM: Position Specific Scoring Matrix, SSI: Secondary Structural Information, FDI: Functional Domain Information.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4334504&req=5

pone.0117804.g004: The feature type distributions in the original and optimal feature set.FFG: Frequencies of Functional Groups, PseAAC: Pseudo Amino Acid Composition, DWT: Discrete Wavelet Transformation, PSSM: Position Specific Scoring Matrix, SSI: Secondary Structural Information, FDI: Functional Domain Information.
Mentions: The four kinds of features in Fig. 3 produce 10 types of feature vectors as given in Table 1. The feature type distributions in the original and optimal feature set are illustrated in Fig. 4. From Fig. 4, it is interesting to note that features from distribution are all in the optimal feature set. This phenomenon may be due to that the features from distribution reflect sequence order information, known to represent important properties for protein attribute prediction [23–25]. Furthermore, a much higher percent of features from functional domain information (15 out of 17) are selected from the original feature set. This finding is consistent with previous studies. In previous works, it was indicated that the key role of ECM in governing protein interactions is attributed to highly conserved domains of ECM proteins [63]. Although slightly less relevant, the other eight types of features also contribute to the identification of ECM proteins. The prediction model integrate multiple sources of descriptors for protein sequences in an attempt to enhance prediction performance.

Bottom Line: Moreover, when tested on a common independent dataset, our method also achieves significantly improved performance over ECMPP and ECMPRED.These results indicate that IECMP is an effective method for ECM protein prediction, which has a more balanced prediction capability for positive and negative samples.It is anticipated that the proposed method will provide significant information to fully decipher the molecular mechanisms of ECM-related biological processes and discover candidate drug targets.

View Article: PubMed Central - PubMed

Affiliation: School of Control Science and Engineering, Shandong University, Jinan, China.

ABSTRACT
The extracellular matrix (ECM) is a dynamic composite of secreted proteins that play important roles in numerous biological processes such as tissue morphogenesis, differentiation and homeostasis. Furthermore, various diseases are caused by the dysfunction of ECM proteins. Therefore, identifying these important ECM proteins may assist in understanding related biological processes and drug development. In view of the serious imbalance in the training dataset, a Random Forest-based ensemble method with hybrid features is developed in this paper to identify ECM proteins. Hybrid features are employed by incorporating sequence composition, physicochemical properties, evolutionary and structural information. The Information Gain Ratio and Incremental Feature Selection (IGR-IFS) methods are adopted to select the optimal features. Finally, the resulting predictor termed IECMP (Identify ECM Proteins) achieves an balanced accuracy of 86.4% using the 10-fold cross-validation on the training dataset, which is much higher than results obtained by other methods (ECMPRED: 71.0%, ECMPP: 77.8%). Moreover, when tested on a common independent dataset, our method also achieves significantly improved performance over ECMPP and ECMPRED. These results indicate that IECMP is an effective method for ECM protein prediction, which has a more balanced prediction capability for positive and negative samples. It is anticipated that the proposed method will provide significant information to fully decipher the molecular mechanisms of ECM-related biological processes and discover candidate drug targets. For public access, we develop a user-friendly web server for ECM protein identification that is freely accessible at http://iecmp.weka.cc.

No MeSH data available.


Related in: MedlinePlus