Limits...
Accurate and efficient target prediction using a potency-sensitive influence-relevance voter.

Lusci A, Browning M, Fooshee D, Swamidass J, Baldi P - J Cheminform (2015)

Bottom Line: We present two improvements over current practice.Second, we demonstrate that random inactive molecules added during training can boost the accuracy of several algorithms in realistic target-prediction experiments.Our potency-sensitive version of the IRV (PS-IRV) obtains the best results on large test sets in most of the experiments.

View Article: PubMed Central - PubMed

Affiliation: School of Information and Computer Sciences, University of California, Irvine, Irvine, USA.

ABSTRACT

Background: A number of algorithms have been proposed to predict the biological targets of diverse molecules. Some are structure-based, but the most common are ligand-based and use chemical fingerprints and the notion of chemical similarity. These methods tend to be computationally faster than others, making them particularly attractive tools as the amount of available data grows.

Results: Using a ChEMBL-derived database covering 490,760 molecule-protein interactions and 3236 protein targets, we conduct a large-scale assessment of the performance of several target-prediction algorithms at predicting drug-target activity. We assess algorithm performance using three validation procedures: standard tenfold cross-validation, tenfold cross-validation in a simulated screen that includes random inactive molecules, and validation on an external test set composed of molecules not present in our database.

Conclusions: We present two improvements over current practice. First, using a modified version of the influence-relevance voter (IRV), we show that using molecule potency data can improve target prediction. Second, we demonstrate that random inactive molecules added during training can boost the accuracy of several algorithms in realistic target-prediction experiments. Our potency-sensitive version of the IRV (PS-IRV) obtains the best results on large test sets in most of the experiments. Models and software are publicly accessible through the chemoinformatics portal at http://chemdb.ics.uci.edu/.

No MeSH data available.


Related in: MedlinePlus

Distribution of the number of molecules associated with each protein. The x axis bins proteins by the number of small-molecule data-points with which they are associated. The y axis plots the number of proteins in each bin. There are 2108 protein targets. About 33 % of these datasets contain more than 100 molecules
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4696267&req=5

Fig1: Distribution of the number of molecules associated with each protein. The x axis bins proteins by the number of small-molecule data-points with which they are associated. The y axis plots the number of proteins in each bin. There are 2108 protein targets. About 33 % of these datasets contain more than 100 molecules

Mentions: This is similar data to that used in several other studies [6–12]. The data from PubChem is excluded because it often does not include EC50 potency data. Molecules were labeled inactive using three different cutoffs: 1, 5, and concentrations. The entire dataset contains 3236 protein targets (cf. Additional file 1 containing the list of corresponding ChEMBL IDs). However, for 1128 of these protein targets, there are fewer than 10 active molecules. These were discarded for being too sparse to enable proper learning, but also because they cannot be used properly in the tenfold cross-validation experiments described below. That left 2108 protein targets with at least 10 molecules each. For 695 of these proteins, the corresponding datasets contain 100 molecules or more. The distribution of the dataset sizes associated with each protein is shown in Fig. 1.


Accurate and efficient target prediction using a potency-sensitive influence-relevance voter.

Lusci A, Browning M, Fooshee D, Swamidass J, Baldi P - J Cheminform (2015)

Distribution of the number of molecules associated with each protein. The x axis bins proteins by the number of small-molecule data-points with which they are associated. The y axis plots the number of proteins in each bin. There are 2108 protein targets. About 33 % of these datasets contain more than 100 molecules
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4696267&req=5

Fig1: Distribution of the number of molecules associated with each protein. The x axis bins proteins by the number of small-molecule data-points with which they are associated. The y axis plots the number of proteins in each bin. There are 2108 protein targets. About 33 % of these datasets contain more than 100 molecules
Mentions: This is similar data to that used in several other studies [6–12]. The data from PubChem is excluded because it often does not include EC50 potency data. Molecules were labeled inactive using three different cutoffs: 1, 5, and concentrations. The entire dataset contains 3236 protein targets (cf. Additional file 1 containing the list of corresponding ChEMBL IDs). However, for 1128 of these protein targets, there are fewer than 10 active molecules. These were discarded for being too sparse to enable proper learning, but also because they cannot be used properly in the tenfold cross-validation experiments described below. That left 2108 protein targets with at least 10 molecules each. For 695 of these proteins, the corresponding datasets contain 100 molecules or more. The distribution of the dataset sizes associated with each protein is shown in Fig. 1.

Bottom Line: We present two improvements over current practice.Second, we demonstrate that random inactive molecules added during training can boost the accuracy of several algorithms in realistic target-prediction experiments.Our potency-sensitive version of the IRV (PS-IRV) obtains the best results on large test sets in most of the experiments.

View Article: PubMed Central - PubMed

Affiliation: School of Information and Computer Sciences, University of California, Irvine, Irvine, USA.

ABSTRACT

Background: A number of algorithms have been proposed to predict the biological targets of diverse molecules. Some are structure-based, but the most common are ligand-based and use chemical fingerprints and the notion of chemical similarity. These methods tend to be computationally faster than others, making them particularly attractive tools as the amount of available data grows.

Results: Using a ChEMBL-derived database covering 490,760 molecule-protein interactions and 3236 protein targets, we conduct a large-scale assessment of the performance of several target-prediction algorithms at predicting drug-target activity. We assess algorithm performance using three validation procedures: standard tenfold cross-validation, tenfold cross-validation in a simulated screen that includes random inactive molecules, and validation on an external test set composed of molecules not present in our database.

Conclusions: We present two improvements over current practice. First, using a modified version of the influence-relevance voter (IRV), we show that using molecule potency data can improve target prediction. Second, we demonstrate that random inactive molecules added during training can boost the accuracy of several algorithms in realistic target-prediction experiments. Our potency-sensitive version of the IRV (PS-IRV) obtains the best results on large test sets in most of the experiments. Models and software are publicly accessible through the chemoinformatics portal at http://chemdb.ics.uci.edu/.

No MeSH data available.


Related in: MedlinePlus