Limits...
How to find simple and accurate rules for viral protease cleavage specificities.

Rögnvaldsson T, Etchells TA, You L, Garwicz D, Jarman I, Lisboa PJ - BMC Bioinformatics (2009)

Bottom Line: However, the hitherto proposed methods for extracting rules have been neither easy to understand nor very accurate.The method is fast to converge and yields accurate rules, on par with previous results for HIV-1 protease and better than previous state-of-the-art for HCV NS3/4A protease.A rule extraction methodology by searching for multivariate low-order predicates yields results that significantly outperform existing rule bases on out-of-sample data, but are more transparent to expert users.

View Article: PubMed Central - HTML - PubMed

Affiliation: Embedded and Intelligent Systems, Halmstad University, Halmstad, Sweden. denni@ide.hh.se

ABSTRACT

Background: Proteases of human pathogens are becoming increasingly important drug targets, hence it is necessary to understand their substrate specificity and to interpret this knowledge in practically useful ways. New methods are being developed that produce large amounts of cleavage information for individual proteases and some have been applied to extract cleavage rules from data. However, the hitherto proposed methods for extracting rules have been neither easy to understand nor very accurate. To be practically useful, cleavage rules should be accurate, compact, and expressed in an easily understandable way.

Results: A new method is presented for producing cleavage rules for viral proteases with seemingly complex cleavage profiles. The method is based on orthogonal search-based rule extraction (OSRE) combined with spectral clustering. It is demonstrated on substrate data sets for human immunodeficiency virus type 1 (HIV-1) protease and hepatitis C (HCV) NS3/4A protease, showing excellent prediction performance for both HIV-1 cleavage and HCV NS3/4A cleavage, agreeing with observed HCV genotype differences. New cleavage rules (consensus sequences) are suggested for HIV-1 and HCV NS3/4A cleavages. The practical usability of the method is also demonstrated by using it to predict the location of an internal cleavage site in the HCV NS3 protease and to correct the location of a previously reported internal cleavage site in the HCV NS3 protease. The method is fast to converge and yields accurate rules, on par with previous results for HIV-1 protease and better than previous state-of-the-art for HCV NS3/4A protease. Moreover, the rules are fewer and simpler than previously obtained with rule extraction methods.

Conclusion: A rule extraction methodology by searching for multivariate low-order predicates yields results that significantly outperform existing rule bases on out-of-sample data, but are more transparent to expert users. The approach yields rules that are easy to use and useful for interpreting experimental data.

Show MeSH

Related in: MedlinePlus

OSRE rules' performance on the HIV-1 PR 1625 data set. The performance of the OSRE rules on the 1625 peptide HIV-1 PR data. The x-axis shows the number of rules used in the prediction. The y-axis shows the CV accuracy (%). The error bars are 1.96 times the binomial standard deviations. The horizontal lines show the accuracy when all rules (even more than 5) are used. The accuracy for zero rules is the default accuracy, when all peptides are classified as the majority class.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2698905&req=5

Figure 2: OSRE rules' performance on the HIV-1 PR 1625 data set. The performance of the OSRE rules on the 1625 peptide HIV-1 PR data. The x-axis shows the number of rules used in the prediction. The y-axis shows the CV accuracy (%). The error bars are 1.96 times the binomial standard deviations. The horizontal lines show the accuracy when all rules (even more than 5) are used. The accuracy for zero rules is the default accuracy, when all peptides are classified as the majority class.

Mentions: Rules for the HIV-1 PR 746 and HIV-1 PR 1621 data sets were extracted using OSRE, as described in the Methods section. OSRE produced slightly different numbers of rules for each cross validation (CV) subset, varying between 7 and 10 rules for the 746 peptide data set and between 6 and 9 rules for the 1625 peptide data set. The CV generalization error was estimated for the rules when one, two, three, four, five and all rules were used to predict cleavage for the hold-out CV data set (rules were ordered in priority order by OSRE); this CV error is shown in Figure 1 and Figure 2. The 746 peptide set is a bit more difficult to predict because it is a balanced data set with fewer negative examples than the 1625 peptide set. The CV performance improves until five rules are used (this was one motivation for using five clusters in the rule clustering). The CV prediction accuracies of the OSRE method when using all rules are 87% for the 746 peptide data set and 93% for the 1625 peptide set.


How to find simple and accurate rules for viral protease cleavage specificities.

Rögnvaldsson T, Etchells TA, You L, Garwicz D, Jarman I, Lisboa PJ - BMC Bioinformatics (2009)

OSRE rules' performance on the HIV-1 PR 1625 data set. The performance of the OSRE rules on the 1625 peptide HIV-1 PR data. The x-axis shows the number of rules used in the prediction. The y-axis shows the CV accuracy (%). The error bars are 1.96 times the binomial standard deviations. The horizontal lines show the accuracy when all rules (even more than 5) are used. The accuracy for zero rules is the default accuracy, when all peptides are classified as the majority class.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2698905&req=5

Figure 2: OSRE rules' performance on the HIV-1 PR 1625 data set. The performance of the OSRE rules on the 1625 peptide HIV-1 PR data. The x-axis shows the number of rules used in the prediction. The y-axis shows the CV accuracy (%). The error bars are 1.96 times the binomial standard deviations. The horizontal lines show the accuracy when all rules (even more than 5) are used. The accuracy for zero rules is the default accuracy, when all peptides are classified as the majority class.
Mentions: Rules for the HIV-1 PR 746 and HIV-1 PR 1621 data sets were extracted using OSRE, as described in the Methods section. OSRE produced slightly different numbers of rules for each cross validation (CV) subset, varying between 7 and 10 rules for the 746 peptide data set and between 6 and 9 rules for the 1625 peptide data set. The CV generalization error was estimated for the rules when one, two, three, four, five and all rules were used to predict cleavage for the hold-out CV data set (rules were ordered in priority order by OSRE); this CV error is shown in Figure 1 and Figure 2. The 746 peptide set is a bit more difficult to predict because it is a balanced data set with fewer negative examples than the 1625 peptide set. The CV performance improves until five rules are used (this was one motivation for using five clusters in the rule clustering). The CV prediction accuracies of the OSRE method when using all rules are 87% for the 746 peptide data set and 93% for the 1625 peptide set.

Bottom Line: However, the hitherto proposed methods for extracting rules have been neither easy to understand nor very accurate.The method is fast to converge and yields accurate rules, on par with previous results for HIV-1 protease and better than previous state-of-the-art for HCV NS3/4A protease.A rule extraction methodology by searching for multivariate low-order predicates yields results that significantly outperform existing rule bases on out-of-sample data, but are more transparent to expert users.

View Article: PubMed Central - HTML - PubMed

Affiliation: Embedded and Intelligent Systems, Halmstad University, Halmstad, Sweden. denni@ide.hh.se

ABSTRACT

Background: Proteases of human pathogens are becoming increasingly important drug targets, hence it is necessary to understand their substrate specificity and to interpret this knowledge in practically useful ways. New methods are being developed that produce large amounts of cleavage information for individual proteases and some have been applied to extract cleavage rules from data. However, the hitherto proposed methods for extracting rules have been neither easy to understand nor very accurate. To be practically useful, cleavage rules should be accurate, compact, and expressed in an easily understandable way.

Results: A new method is presented for producing cleavage rules for viral proteases with seemingly complex cleavage profiles. The method is based on orthogonal search-based rule extraction (OSRE) combined with spectral clustering. It is demonstrated on substrate data sets for human immunodeficiency virus type 1 (HIV-1) protease and hepatitis C (HCV) NS3/4A protease, showing excellent prediction performance for both HIV-1 cleavage and HCV NS3/4A cleavage, agreeing with observed HCV genotype differences. New cleavage rules (consensus sequences) are suggested for HIV-1 and HCV NS3/4A cleavages. The practical usability of the method is also demonstrated by using it to predict the location of an internal cleavage site in the HCV NS3 protease and to correct the location of a previously reported internal cleavage site in the HCV NS3 protease. The method is fast to converge and yields accurate rules, on par with previous results for HIV-1 protease and better than previous state-of-the-art for HCV NS3/4A protease. Moreover, the rules are fewer and simpler than previously obtained with rule extraction methods.

Conclusion: A rule extraction methodology by searching for multivariate low-order predicates yields results that significantly outperform existing rule bases on out-of-sample data, but are more transparent to expert users. The approach yields rules that are easy to use and useful for interpreting experimental data.

Show MeSH
Related in: MedlinePlus