Limits...
Towards structured output prediction of enzyme function.

Astikainen K, Holm L, Pitkänen E, Szedmak S, Rousu J - BMC Proc (2008)

Bottom Line: A polynomial kernel over the GTG feature set turned out to be a prerequisite for accurate function prediction.Combining GTG with string kernels boosted accuracy slightly in the case of EC class prediction.Structured output prediction with GTG features is shown to be computationally feasible and to have accuracy on par with state-of-the-art approaches in enzyme function prediction.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, PO Box 68, FI-00014 University of Helsinki, Finland. astikain@cs.helsinki.fi

ABSTRACT

Background: In this paper we describe work in progress in developing kernel methods for enzyme function prediction. Our focus is in developing so called structured output prediction methods, where the enzymatic reaction is the combinatorial target object for prediction. We compared two structured output prediction methods, the Hierarchical Max-Margin Markov algorithm (HM3) and the Maximum Margin Regression algorithm (MMR) in hierarchical classification of enzyme function. As sequence features we use various string kernels and the GTG feature set derived from the global alignment trace graph of protein sequences.

Results: In our experiments, in predicting enzyme EC classification we obtain over 85% accuracy (predicting the four digit EC code) and over 91% microlabel F1 score (predicting individual EC digits). In predicting the Gold Standard enzyme families, we obtain over 79% accuracy (predicting family correctly) and over 89% microlabel F1 score (predicting superfamilies and families). In the latter case, structured output methods are significantly more accurate than nearest neighbor classifier. A polynomial kernel over the GTG feature set turned out to be a prerequisite for accurate function prediction. Combining GTG with string kernels boosted accuracy slightly in the case of EC class prediction.

Conclusion: Structured output prediction with GTG features is shown to be computationally feasible and to have accuracy on par with state-of-the-art approaches in enzyme function prediction.

No MeSH data available.


Related in: MedlinePlus

Effect of training set size and nearest neigbor distance to MMR accuracy. Depicted on the left is the MMR Gold Standard classification F1-score when trainingset size m and the kernel value between test example and trainingset nearest neighbor are increasing. On the right the same information is given for EC dataset. Values in parenthesis are average numbers of test examples with given training set size and kernel value interval.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2654971&req=5

Figure 3: Effect of training set size and nearest neigbor distance to MMR accuracy. Depicted on the left is the MMR Gold Standard classification F1-score when trainingset size m and the kernel value between test example and trainingset nearest neighbor are increasing. On the right the same information is given for EC dataset. Values in parenthesis are average numbers of test examples with given training set size and kernel value interval.

Mentions: Figure 3 depicts a heat map of MMR microlabel F1 score on the GS (left) and EC datasets (right) when the training set size is varied. The test set is divided along the vertical axis by the closest sequence neighbor in the dataset. It can be seen that on the GS data, the predictive accuracy improves by the increasing training set size, and the improvement is more clear for enzymes whose closest neighbor is at medium distance. On the EC dataset, the microlabel F1 is high even with small training set size and increasing the training set size affects the accuracy only a little, except for the enzymes with only distant neighbors. Figure 4 depicts a heat map of the difference in microlabel F1 score between MMR and NN. Red color indicates small difference, yellow indicates a larger difference in favor of MMR. On the GS dataset (Figure 4, left), both of the initial expectations turned out to be wrong: MMR works the better compared to NN the larger the training set and the closer are the nearest sequence neighbors. On the EC dataset (Figure 4, right), almost no difference can be seen irrespective of training set size or the closeness of sequence neighbors.


Towards structured output prediction of enzyme function.

Astikainen K, Holm L, Pitkänen E, Szedmak S, Rousu J - BMC Proc (2008)

Effect of training set size and nearest neigbor distance to MMR accuracy. Depicted on the left is the MMR Gold Standard classification F1-score when trainingset size m and the kernel value between test example and trainingset nearest neighbor are increasing. On the right the same information is given for EC dataset. Values in parenthesis are average numbers of test examples with given training set size and kernel value interval.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2654971&req=5

Figure 3: Effect of training set size and nearest neigbor distance to MMR accuracy. Depicted on the left is the MMR Gold Standard classification F1-score when trainingset size m and the kernel value between test example and trainingset nearest neighbor are increasing. On the right the same information is given for EC dataset. Values in parenthesis are average numbers of test examples with given training set size and kernel value interval.
Mentions: Figure 3 depicts a heat map of MMR microlabel F1 score on the GS (left) and EC datasets (right) when the training set size is varied. The test set is divided along the vertical axis by the closest sequence neighbor in the dataset. It can be seen that on the GS data, the predictive accuracy improves by the increasing training set size, and the improvement is more clear for enzymes whose closest neighbor is at medium distance. On the EC dataset, the microlabel F1 is high even with small training set size and increasing the training set size affects the accuracy only a little, except for the enzymes with only distant neighbors. Figure 4 depicts a heat map of the difference in microlabel F1 score between MMR and NN. Red color indicates small difference, yellow indicates a larger difference in favor of MMR. On the GS dataset (Figure 4, left), both of the initial expectations turned out to be wrong: MMR works the better compared to NN the larger the training set and the closer are the nearest sequence neighbors. On the EC dataset (Figure 4, right), almost no difference can be seen irrespective of training set size or the closeness of sequence neighbors.

Bottom Line: A polynomial kernel over the GTG feature set turned out to be a prerequisite for accurate function prediction.Combining GTG with string kernels boosted accuracy slightly in the case of EC class prediction.Structured output prediction with GTG features is shown to be computationally feasible and to have accuracy on par with state-of-the-art approaches in enzyme function prediction.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science, PO Box 68, FI-00014 University of Helsinki, Finland. astikain@cs.helsinki.fi

ABSTRACT

Background: In this paper we describe work in progress in developing kernel methods for enzyme function prediction. Our focus is in developing so called structured output prediction methods, where the enzymatic reaction is the combinatorial target object for prediction. We compared two structured output prediction methods, the Hierarchical Max-Margin Markov algorithm (HM3) and the Maximum Margin Regression algorithm (MMR) in hierarchical classification of enzyme function. As sequence features we use various string kernels and the GTG feature set derived from the global alignment trace graph of protein sequences.

Results: In our experiments, in predicting enzyme EC classification we obtain over 85% accuracy (predicting the four digit EC code) and over 91% microlabel F1 score (predicting individual EC digits). In predicting the Gold Standard enzyme families, we obtain over 79% accuracy (predicting family correctly) and over 89% microlabel F1 score (predicting superfamilies and families). In the latter case, structured output methods are significantly more accurate than nearest neighbor classifier. A polynomial kernel over the GTG feature set turned out to be a prerequisite for accurate function prediction. Combining GTG with string kernels boosted accuracy slightly in the case of EC class prediction.

Conclusion: Structured output prediction with GTG features is shown to be computationally feasible and to have accuracy on par with state-of-the-art approaches in enzyme function prediction.

No MeSH data available.


Related in: MedlinePlus