Limits...
The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening

View Article: PubMed Central - PubMed

ABSTRACT

The machine learning-based virtual screening of molecular databases is a commonly used approach to identify hits. However, many aspects associated with training predictive models can influence the final performance and, consequently, the number of hits found. Thus, we performed a systematic study of the simultaneous influence of the proportion of negatives to positives in the testing set, the size of screening databases and the type of molecular representations on the effectiveness of classification. The results obtained for eight protein targets, five machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest), two types of molecular fingerprints (MACCS and CDK FP) and eight screening databases with different numbers of molecules confirmed our previous findings that increases in the ratio of negative to positive training instances greatly influenced most of the investigated parameters of the ML methods in simulated virtual screening experiments. However, the performance of screening was shown to also be highly dependent on the molecular library dimension. Generally, with the increasing size of the screened database, the optimal training ratio also increased, and this ratio can be rationalized using the proposed cost-effectiveness threshold approach. To increase the performance of machine learning-based virtual screening, the training set should be constructed in a way that considers the size of the screening database.

No MeSH data available.


Related in: MedlinePlus

The dependence of the negative training set size on machine learning-based virtual screening performance for 2 types of fingerprints (panel A–CDK FP, and MACCS FP in B) averaged over 10 independent trials.The colored lines denote the type of evaluated parameter used (blue–recall, red–precision, magenta–MCC and green–PR plot).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5383296&req=5

pone.0175410.g001: The dependence of the negative training set size on machine learning-based virtual screening performance for 2 types of fingerprints (panel A–CDK FP, and MACCS FP in B) averaged over 10 independent trials.The colored lines denote the type of evaluated parameter used (blue–recall, red–precision, magenta–MCC and green–PR plot).

Mentions: The results obtained for 5-HT1AR are presented in Fig 1 (panel A for CDK FP and B for MACCS FP), showing recall, precision, MCC and PR plots for five ML methods and eight screening libraries of different sizes (5000–400,000 compounds); data for the remaining protein targets are available in the Supporting Information (S1 Fig). A single plot illustrates the relation between the average (after 10 iterations) value of a given performance measure, which was calculated for a combination of the IN/A training ratio and the set of screening databases used.


The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening
The dependence of the negative training set size on machine learning-based virtual screening performance for 2 types of fingerprints (panel A–CDK FP, and MACCS FP in B) averaged over 10 independent trials.The colored lines denote the type of evaluated parameter used (blue–recall, red–precision, magenta–MCC and green–PR plot).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5383296&req=5

pone.0175410.g001: The dependence of the negative training set size on machine learning-based virtual screening performance for 2 types of fingerprints (panel A–CDK FP, and MACCS FP in B) averaged over 10 independent trials.The colored lines denote the type of evaluated parameter used (blue–recall, red–precision, magenta–MCC and green–PR plot).
Mentions: The results obtained for 5-HT1AR are presented in Fig 1 (panel A for CDK FP and B for MACCS FP), showing recall, precision, MCC and PR plots for five ML methods and eight screening libraries of different sizes (5000–400,000 compounds); data for the remaining protein targets are available in the Supporting Information (S1 Fig). A single plot illustrates the relation between the average (after 10 iterations) value of a given performance measure, which was calculated for a combination of the IN/A training ratio and the set of screening databases used.

View Article: PubMed Central - PubMed

ABSTRACT

The machine learning-based virtual screening of molecular databases is a commonly used approach to identify hits. However, many aspects associated with training predictive models can influence the final performance and, consequently, the number of hits found. Thus, we performed a systematic study of the simultaneous influence of the proportion of negatives to positives in the testing set, the size of screening databases and the type of molecular representations on the effectiveness of classification. The results obtained for eight protein targets, five machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest), two types of molecular fingerprints (MACCS and CDK FP) and eight screening databases with different numbers of molecules confirmed our previous findings that increases in the ratio of negative to positive training instances greatly influenced most of the investigated parameters of the ML methods in simulated virtual screening experiments. However, the performance of screening was shown to also be highly dependent on the molecular library dimension. Generally, with the increasing size of the screened database, the optimal training ratio also increased, and this ratio can be rationalized using the proposed cost-effectiveness threshold approach. To increase the performance of machine learning-based virtual screening, the training set should be constructed in a way that considers the size of the screening database.

No MeSH data available.


Related in: MedlinePlus