Limits...
Is EC class predictable from reaction mechanism?

Nath N, Mitchell JB - BMC Bioinformatics (2012)

Bottom Line: SVM and RF models perform comparably well; kNN is less successful.Oxidoreductases, hydrolases, and to some extent isomerases and ligases, have clear chemical signatures, making them easier to predict than transferases and lyases.We find evidence that isomerases as a class are notably mechanistically diverse and that their one shared property, of substrate and product being isomers, can arise in various unrelated ways.The performance of the different machine learning algorithms is in line with many cheminformatics applications, with SVM and RF being roughly equally effective. kNN is less successful, given the role that non-local information plays in successful classification.

View Article: PubMed Central - HTML - PubMed

Affiliation: Biomedical Sciences Research Complex and EaStCHEM School of Chemistry, Purdie Building, University of St Andrews, North Haugh, St Andrews, Scotland KY16 9ST, UK.

ABSTRACT

Background: We investigate the relationships between the EC (Enzyme Commission) class, the associated chemical reaction, and the reaction mechanism by building predictive models using Support Vector Machine (SVM), Random Forest (RF) and k-Nearest Neighbours (kNN). We consider two ways of encoding the reaction mechanism in descriptors, and also three approaches that encode only the overall chemical reaction. Both cross-validation and also an external test set are used.

Results: The three descriptor sets encoding overall chemical transformation perform better than the two descriptions of mechanism. SVM and RF models perform comparably well; kNN is less successful. Oxidoreductases and hydrolases are relatively well predicted by all types of descriptor; isomerases are well predicted by overall reaction descriptors but not by mechanistic ones.

Conclusions: Our results suggest that pairs of similar enzyme reactions tend to proceed by different mechanisms. Oxidoreductases, hydrolases, and to some extent isomerases and ligases, have clear chemical signatures, making them easier to predict than transferases and lyases. We find evidence that isomerases as a class are notably mechanistically diverse and that their one shared property, of substrate and product being isomers, can arise in various unrelated ways.The performance of the different machine learning algorithms is in line with many cheminformatics applications, with SVM and RF being roughly equally effective. kNN is less successful, given the role that non-local information plays in successful classification. We note also that, despite a lack of clarity in the literature, EC number prediction is not a single problem; the challenge of predicting protein function from available sequence data is quite different from assigning an EC classification from a cheminformatics representation of a reaction.

Show MeSH

Related in: MedlinePlus

Linear separating hyperplane for the binary classification. The solid line shows the maximum margin hyperplane separating the red and green classes. The dotted lines show the margins and the highlighted points are the support vectors.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3368749&req=5

Figure 1: Linear separating hyperplane for the binary classification. The solid line shows the maximum margin hyperplane separating the red and green classes. The dotted lines show the margins and the highlighted points are the support vectors.

Mentions: SVM is a learning method based on statistical learning theory. The SVM maps the inputs into a high-dimensional feature space through a chosen kernel function. The dimensionality of the feature space is determined by the number of support vectors extracted from the training data. In principle, we seek the optimal separating hyperplane between the two classes by maximizing the margin between the closest points (support vectors) (Figure 1). Noble has explained the essence of SVM with respect to four basic concepts [46]: (1) The separating hyperplane, (2) The maximum-margin hyperplane, (3) The soft margin and (4) The kernel function. SVM is used to solve the task of assigning objects to classes.


Is EC class predictable from reaction mechanism?

Nath N, Mitchell JB - BMC Bioinformatics (2012)

Linear separating hyperplane for the binary classification. The solid line shows the maximum margin hyperplane separating the red and green classes. The dotted lines show the margins and the highlighted points are the support vectors.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3368749&req=5

Figure 1: Linear separating hyperplane for the binary classification. The solid line shows the maximum margin hyperplane separating the red and green classes. The dotted lines show the margins and the highlighted points are the support vectors.
Mentions: SVM is a learning method based on statistical learning theory. The SVM maps the inputs into a high-dimensional feature space through a chosen kernel function. The dimensionality of the feature space is determined by the number of support vectors extracted from the training data. In principle, we seek the optimal separating hyperplane between the two classes by maximizing the margin between the closest points (support vectors) (Figure 1). Noble has explained the essence of SVM with respect to four basic concepts [46]: (1) The separating hyperplane, (2) The maximum-margin hyperplane, (3) The soft margin and (4) The kernel function. SVM is used to solve the task of assigning objects to classes.

Bottom Line: SVM and RF models perform comparably well; kNN is less successful.Oxidoreductases, hydrolases, and to some extent isomerases and ligases, have clear chemical signatures, making them easier to predict than transferases and lyases.We find evidence that isomerases as a class are notably mechanistically diverse and that their one shared property, of substrate and product being isomers, can arise in various unrelated ways.The performance of the different machine learning algorithms is in line with many cheminformatics applications, with SVM and RF being roughly equally effective. kNN is less successful, given the role that non-local information plays in successful classification.

View Article: PubMed Central - HTML - PubMed

Affiliation: Biomedical Sciences Research Complex and EaStCHEM School of Chemistry, Purdie Building, University of St Andrews, North Haugh, St Andrews, Scotland KY16 9ST, UK.

ABSTRACT

Background: We investigate the relationships between the EC (Enzyme Commission) class, the associated chemical reaction, and the reaction mechanism by building predictive models using Support Vector Machine (SVM), Random Forest (RF) and k-Nearest Neighbours (kNN). We consider two ways of encoding the reaction mechanism in descriptors, and also three approaches that encode only the overall chemical reaction. Both cross-validation and also an external test set are used.

Results: The three descriptor sets encoding overall chemical transformation perform better than the two descriptions of mechanism. SVM and RF models perform comparably well; kNN is less successful. Oxidoreductases and hydrolases are relatively well predicted by all types of descriptor; isomerases are well predicted by overall reaction descriptors but not by mechanistic ones.

Conclusions: Our results suggest that pairs of similar enzyme reactions tend to proceed by different mechanisms. Oxidoreductases, hydrolases, and to some extent isomerases and ligases, have clear chemical signatures, making them easier to predict than transferases and lyases. We find evidence that isomerases as a class are notably mechanistically diverse and that their one shared property, of substrate and product being isomers, can arise in various unrelated ways.The performance of the different machine learning algorithms is in line with many cheminformatics applications, with SVM and RF being roughly equally effective. kNN is less successful, given the role that non-local information plays in successful classification. We note also that, despite a lack of clarity in the literature, EC number prediction is not a single problem; the challenge of predicting protein function from available sequence data is quite different from assigning an EC classification from a cheminformatics representation of a reaction.

Show MeSH
Related in: MedlinePlus