Limits...
Predicting pathway membership via domain signatures.

Fröhlich H, Fellmann M, Sültmann H, Poustka A, Beissbarth T - Bioinformatics (2008)

Bottom Line: In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database.Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components.The R package gene2pathway is a supplement to this article.

View Article: PubMed Central - PubMed

Affiliation: German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69120 Heidelberg, Germany. h.froehlich@dkfz-heidelberg.de

ABSTRACT

Motivation: Functional characterization of genes is of great importance for the understanding of complex cellular processes. Valuable information for this purpose can be obtained from pathway databases, like KEGG. However, only a small fraction of genes is annotated with pathway information up to now. In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database.

Results: We present a classification model, which for a specific gene of interest can predict the mapping to a KEGG pathway, based on its domain signature. The classifier makes explicit use of the hierarchical organization of pathways in the KEGG database. Furthermore, we take into account that a specific gene can be mapped to different pathways at the same time. The classification method produces a scoring of all possible mapping positions of the gene in the KEGG hierarchy. Evaluations of our model, which is a combination of a SVM and ranking perceptron approach, show a high prediction performance. Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components.

Availability: The R package gene2pathway is a supplement to this article.

Show MeSH
Prediction performance of our method (10×10-fold cross-validation). The accuracy measure uses the same loss function, which was used to train the classifier, and which takes into account the KEGG hierarchy. (A) Pathway prediction within pruned KEGG hierarchy (53 branches). (B) Pathway component prediction for signaling pathways (19 branches).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2553439&req=5

Figure 1: Prediction performance of our method (10×10-fold cross-validation). The accuracy measure uses the same loss function, which was used to train the classifier, and which takes into account the KEGG hierarchy. (A) Pathway prediction within pruned KEGG hierarchy (53 branches). (B) Pathway component prediction for signaling pathways (19 branches).

Mentions: Given the position labeled dataset D, we employ the modified ranking perceptron algorithm presented in (Melvin et al., 2007) to learn a weight vector w of the input code vectors . In the spirit of SVM classifiers, the weight vector is optimized to maximize the margin between position code vectors Ci, Cj with Ci≠Cj in input code vector space. The algorithm shown in Figure 1 involves updating w proportional to the loss we obtain by predicting a wrong position vector Cj instead of the true position vector Ci. The choice of this loss function is the essential part of the algorithm, because it reflects our knowledge about the KEGG hierarchy. Making a wrong prediction at the higher levels of the hierarchy should be punished more than confusing individual KEGG pathways at the bottom level. We therefore set up the following loss function:(3)where Anc denotes the set of all ancestors of branch j and 1 is the indicator function. By this loss function we punish the first mismatch on the path down the hierarchy to the final predicted position. The higher in the hierarchy the mismatch occurs, the higher the punishment ci should be. We thus choose(4)where /T(i)/ denotes the size of the hierarchy down of branch i and /T(root)/ is the size of the complete KEGG hierarchy.


Predicting pathway membership via domain signatures.

Fröhlich H, Fellmann M, Sültmann H, Poustka A, Beissbarth T - Bioinformatics (2008)

Prediction performance of our method (10×10-fold cross-validation). The accuracy measure uses the same loss function, which was used to train the classifier, and which takes into account the KEGG hierarchy. (A) Pathway prediction within pruned KEGG hierarchy (53 branches). (B) Pathway component prediction for signaling pathways (19 branches).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2553439&req=5

Figure 1: Prediction performance of our method (10×10-fold cross-validation). The accuracy measure uses the same loss function, which was used to train the classifier, and which takes into account the KEGG hierarchy. (A) Pathway prediction within pruned KEGG hierarchy (53 branches). (B) Pathway component prediction for signaling pathways (19 branches).
Mentions: Given the position labeled dataset D, we employ the modified ranking perceptron algorithm presented in (Melvin et al., 2007) to learn a weight vector w of the input code vectors . In the spirit of SVM classifiers, the weight vector is optimized to maximize the margin between position code vectors Ci, Cj with Ci≠Cj in input code vector space. The algorithm shown in Figure 1 involves updating w proportional to the loss we obtain by predicting a wrong position vector Cj instead of the true position vector Ci. The choice of this loss function is the essential part of the algorithm, because it reflects our knowledge about the KEGG hierarchy. Making a wrong prediction at the higher levels of the hierarchy should be punished more than confusing individual KEGG pathways at the bottom level. We therefore set up the following loss function:(3)where Anc denotes the set of all ancestors of branch j and 1 is the indicator function. By this loss function we punish the first mismatch on the path down the hierarchy to the final predicted position. The higher in the hierarchy the mismatch occurs, the higher the punishment ci should be. We thus choose(4)where /T(i)/ denotes the size of the hierarchy down of branch i and /T(root)/ is the size of the complete KEGG hierarchy.

Bottom Line: In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database.Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components.The R package gene2pathway is a supplement to this article.

View Article: PubMed Central - PubMed

Affiliation: German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69120 Heidelberg, Germany. h.froehlich@dkfz-heidelberg.de

ABSTRACT

Motivation: Functional characterization of genes is of great importance for the understanding of complex cellular processes. Valuable information for this purpose can be obtained from pathway databases, like KEGG. However, only a small fraction of genes is annotated with pathway information up to now. In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database.

Results: We present a classification model, which for a specific gene of interest can predict the mapping to a KEGG pathway, based on its domain signature. The classifier makes explicit use of the hierarchical organization of pathways in the KEGG database. Furthermore, we take into account that a specific gene can be mapped to different pathways at the same time. The classification method produces a scoring of all possible mapping positions of the gene in the KEGG hierarchy. Evaluations of our model, which is a combination of a SVM and ranking perceptron approach, show a high prediction performance. Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components.

Availability: The R package gene2pathway is a supplement to this article.

Show MeSH