Limits...
Multi-level machine learning prediction of protein-protein interactions in Saccharomyces cerevisiae.

Zubek J, Tatjewski M, Boniecki A, Mnich M, Basu S, Plewczynski D - PeerJ (2015)

Bottom Line: The level-II predictor improves the results further by a more complex learning paradigm.We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment.Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations.

View Article: PubMed Central - HTML - PubMed

Affiliation: Centre of New Technologies, University of Warsaw , Warsaw , Poland ; Institute of Computer Science, Polish Academy of Sciences , Warsaw , Poland.

ABSTRACT
Accurate identification of protein-protein interactions (PPI) is the key step in understanding proteins' biological functions, which are typically context-dependent. Many existing PPI predictors rely on aggregated features from protein sequences, however only a few methods exploit local information about specific residue contacts. In this work we present a two-stage machine learning approach for prediction of protein-protein interactions. We start with the carefully filtered data on protein complexes available for Saccharomyces cerevisiae in the Protein Data Bank (PDB) database. First, we build linear descriptions of interacting and non-interacting sequence segment pairs based on their inter-residue distances. Secondly, we train machine learning classifiers to predict binary segment interactions for any two short sequence fragments. The final prediction of the protein-protein interaction is done using the 2D matrix representation of all-against-all possible interacting sequence segments of both analysed proteins. The level-I predictor achieves 0.88 AUC for micro-scale, i.e., residue-level prediction. The level-II predictor improves the results further by a more complex learning paradigm. We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment. The level-II predictor using PSIPRED-predicted secondary structure reaches 0.70 precision, 0.68 recall, and 0.70 AUC, whereas other popular methods provide results below 0.6 threshold (recall, precision, AUC). Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations. Prepared datasets and source code for our experimental pipeline are freely available for download from: http://zubekj.github.io/mlppi/ (open source Python implementation, OS independent).

No MeSH data available.


Schema of train-test split for evaluation of trained classifiers.Numbers of examples used in each step are given in parentheses.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4493684&req=5

fig-2: Schema of train-test split for evaluation of trained classifiers.Numbers of examples used in each step are given in parentheses.

Mentions: The last step of data generation procedure is splitting samples between training and testing sets. In order to truly evaluate our method in a realistic setup, we split the benchmarking dataset at the protein level, not at the residue level. This made our goal more difficult as compared to previous works that often used residue-level splitting of benchmarking dataset. The schematic depiction of the train-test split is given by Fig. 2. The Level I classifier was evaluated through train-test experiment with a relatively large datasets. The Level-II classifier was evaluated through cross-validation experiment which used the testing set only—information from the training set came only in the form of the trained level-I classifier. Such schema eliminated the risk of overoptimistic performance estimates caused by the same data appearing during training and testing phases.


Multi-level machine learning prediction of protein-protein interactions in Saccharomyces cerevisiae.

Zubek J, Tatjewski M, Boniecki A, Mnich M, Basu S, Plewczynski D - PeerJ (2015)

Schema of train-test split for evaluation of trained classifiers.Numbers of examples used in each step are given in parentheses.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4493684&req=5

fig-2: Schema of train-test split for evaluation of trained classifiers.Numbers of examples used in each step are given in parentheses.
Mentions: The last step of data generation procedure is splitting samples between training and testing sets. In order to truly evaluate our method in a realistic setup, we split the benchmarking dataset at the protein level, not at the residue level. This made our goal more difficult as compared to previous works that often used residue-level splitting of benchmarking dataset. The schematic depiction of the train-test split is given by Fig. 2. The Level I classifier was evaluated through train-test experiment with a relatively large datasets. The Level-II classifier was evaluated through cross-validation experiment which used the testing set only—information from the training set came only in the form of the trained level-I classifier. Such schema eliminated the risk of overoptimistic performance estimates caused by the same data appearing during training and testing phases.

Bottom Line: The level-II predictor improves the results further by a more complex learning paradigm.We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment.Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations.

View Article: PubMed Central - HTML - PubMed

Affiliation: Centre of New Technologies, University of Warsaw , Warsaw , Poland ; Institute of Computer Science, Polish Academy of Sciences , Warsaw , Poland.

ABSTRACT
Accurate identification of protein-protein interactions (PPI) is the key step in understanding proteins' biological functions, which are typically context-dependent. Many existing PPI predictors rely on aggregated features from protein sequences, however only a few methods exploit local information about specific residue contacts. In this work we present a two-stage machine learning approach for prediction of protein-protein interactions. We start with the carefully filtered data on protein complexes available for Saccharomyces cerevisiae in the Protein Data Bank (PDB) database. First, we build linear descriptions of interacting and non-interacting sequence segment pairs based on their inter-residue distances. Secondly, we train machine learning classifiers to predict binary segment interactions for any two short sequence fragments. The final prediction of the protein-protein interaction is done using the 2D matrix representation of all-against-all possible interacting sequence segments of both analysed proteins. The level-I predictor achieves 0.88 AUC for micro-scale, i.e., residue-level prediction. The level-II predictor improves the results further by a more complex learning paradigm. We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment. The level-II predictor using PSIPRED-predicted secondary structure reaches 0.70 precision, 0.68 recall, and 0.70 AUC, whereas other popular methods provide results below 0.6 threshold (recall, precision, AUC). Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations. Prepared datasets and source code for our experimental pipeline are freely available for download from: http://zubekj.github.io/mlppi/ (open source Python implementation, OS independent).

No MeSH data available.