Limits...
Regression applied to protein binding site prediction and comparison with classification.

Giard J, Ambroise J, Gala JL, Macq B - BMC Bioinformatics (2009)

Bottom Line: We compared also the predictive performance of our patches based method using Multilayer Perceptron with the performance of three other methods usable through a web server.Furthermore, the method presented in this work is flexible because the size of the predicted binding site is adjustable.This adaptability is useful when either false positive or negative rates have to be limited.

View Article: PubMed Central - HTML - PubMed

Affiliation: Communications and Remote Sensing Laboratory, Université Catholique de Louvain, Place du Levant 2, 1348 Louvain-la-Neuve, Belgium. joachim.giard@uclouvain.be

ABSTRACT

Background: The structural genomics centers provide hundreds of protein structures of unknown function. Therefore, developing methods enabling the determination of a protein function automatically is imperative. The determination of a protein function can be achieved by studying the network of its physical interactions. In this context, identifying a potential binding site between proteins is of primary interest. In the literature, methods for predicting a potential binding site location generally are based on classification tools. The aim of this paper is to show that regression tools are more efficient than classification tools for patches based binding site predictors. For this purpose, we developed a patches based binding site localization method usable with either regression or classification tools.

Results: We compared predictive performances of regression tools with performances of machine learning classifiers. Using leave-one-out cross-validation, we showed that regression tools provide better predictions than classification ones. Among regression tools, Multilayer Perceptron ranked highest in the quality of predictions. We compared also the predictive performance of our patches based method using Multilayer Perceptron with the performance of three other methods usable through a web server. Our method performed similarly to the other methods.

Conclusion: Regression is more efficient than classification when applied to our binding site localization method. When it is possible, using regression instead of classification for other existing binding site predictors will probably improve results. Furthermore, the method presented in this work is flexible because the size of the predicted binding site is adjustable. This adaptability is useful when either false positive or negative rates have to be limited.

Show MeSH

Related in: MedlinePlus

Correlation coefficients box plot. Box plot of the correlation coefficients for each statistical tool. Each correlation coefficient is computed between the scores of the patches of one protein and their real PPV. Each statistical tool is characterized by 180 correlation coefficients, corresponding to the 180 proteins used for the validation of our method. Coefficients for regressions are shown in blue and coefficients for supervised classification are shown in yellow.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2749839&req=5

Figure 4: Correlation coefficients box plot. Box plot of the correlation coefficients for each statistical tool. Each correlation coefficient is computed between the scores of the patches of one protein and their real PPV. Each statistical tool is characterized by 180 correlation coefficients, corresponding to the 180 proteins used for the validation of our method. Coefficients for regressions are shown in blue and coefficients for supervised classification are shown in yellow.

Mentions: After the leave-one-out cross-validation, each statistical tool was associated to 180 correlation coefficients, each of them corresponding to one protein. Distributions of these correlation coefficients for regressions and classifications are compared in the Box Plot shown in Figure 4. Using regression to fit the relationship between the properties and the PPV frequently led to better score correlations than using classification. As the correlation coefficients vary considerably from one protein to another for the same statistical tool, the box plots are stretched and overlap a lot. Building a relative box plot allows to better visualize which statistical tool generally gives the best results. For this purpose, relative correlation coefficients were computed for each protein separately, relatively to the mean of the scores for all the statistical tools. A positive relative coefficient means that the statistical tool better works than the average. The distribution of these relative coefficients is depicted in a box plot in Figure 5. The box plot of the relative correlation coefficients shows that, for one particular protein, regressions generally gave better prediction than the average, whereas classification generally gave worse prediction than the average.


Regression applied to protein binding site prediction and comparison with classification.

Giard J, Ambroise J, Gala JL, Macq B - BMC Bioinformatics (2009)

Correlation coefficients box plot. Box plot of the correlation coefficients for each statistical tool. Each correlation coefficient is computed between the scores of the patches of one protein and their real PPV. Each statistical tool is characterized by 180 correlation coefficients, corresponding to the 180 proteins used for the validation of our method. Coefficients for regressions are shown in blue and coefficients for supervised classification are shown in yellow.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2749839&req=5

Figure 4: Correlation coefficients box plot. Box plot of the correlation coefficients for each statistical tool. Each correlation coefficient is computed between the scores of the patches of one protein and their real PPV. Each statistical tool is characterized by 180 correlation coefficients, corresponding to the 180 proteins used for the validation of our method. Coefficients for regressions are shown in blue and coefficients for supervised classification are shown in yellow.
Mentions: After the leave-one-out cross-validation, each statistical tool was associated to 180 correlation coefficients, each of them corresponding to one protein. Distributions of these correlation coefficients for regressions and classifications are compared in the Box Plot shown in Figure 4. Using regression to fit the relationship between the properties and the PPV frequently led to better score correlations than using classification. As the correlation coefficients vary considerably from one protein to another for the same statistical tool, the box plots are stretched and overlap a lot. Building a relative box plot allows to better visualize which statistical tool generally gives the best results. For this purpose, relative correlation coefficients were computed for each protein separately, relatively to the mean of the scores for all the statistical tools. A positive relative coefficient means that the statistical tool better works than the average. The distribution of these relative coefficients is depicted in a box plot in Figure 5. The box plot of the relative correlation coefficients shows that, for one particular protein, regressions generally gave better prediction than the average, whereas classification generally gave worse prediction than the average.

Bottom Line: We compared also the predictive performance of our patches based method using Multilayer Perceptron with the performance of three other methods usable through a web server.Furthermore, the method presented in this work is flexible because the size of the predicted binding site is adjustable.This adaptability is useful when either false positive or negative rates have to be limited.

View Article: PubMed Central - HTML - PubMed

Affiliation: Communications and Remote Sensing Laboratory, Université Catholique de Louvain, Place du Levant 2, 1348 Louvain-la-Neuve, Belgium. joachim.giard@uclouvain.be

ABSTRACT

Background: The structural genomics centers provide hundreds of protein structures of unknown function. Therefore, developing methods enabling the determination of a protein function automatically is imperative. The determination of a protein function can be achieved by studying the network of its physical interactions. In this context, identifying a potential binding site between proteins is of primary interest. In the literature, methods for predicting a potential binding site location generally are based on classification tools. The aim of this paper is to show that regression tools are more efficient than classification tools for patches based binding site predictors. For this purpose, we developed a patches based binding site localization method usable with either regression or classification tools.

Results: We compared predictive performances of regression tools with performances of machine learning classifiers. Using leave-one-out cross-validation, we showed that regression tools provide better predictions than classification ones. Among regression tools, Multilayer Perceptron ranked highest in the quality of predictions. We compared also the predictive performance of our patches based method using Multilayer Perceptron with the performance of three other methods usable through a web server. Our method performed similarly to the other methods.

Conclusion: Regression is more efficient than classification when applied to our binding site localization method. When it is possible, using regression instead of classification for other existing binding site predictors will probably improve results. Furthermore, the method presented in this work is flexible because the size of the predicted binding site is adjustable. This adaptability is useful when either false positive or negative rates have to be limited.

Show MeSH
Related in: MedlinePlus