Limits...
Predicting pathway membership via domain signatures.

Fröhlich H, Fellmann M, Sültmann H, Poustka A, Beissbarth T - Bioinformatics (2008)

Bottom Line: In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database.Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components.The R package gene2pathway is a supplement to this article.

View Article: PubMed Central - PubMed

Affiliation: German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69120 Heidelberg, Germany. h.froehlich@dkfz-heidelberg.de

ABSTRACT

Motivation: Functional characterization of genes is of great importance for the understanding of complex cellular processes. Valuable information for this purpose can be obtained from pathway databases, like KEGG. However, only a small fraction of genes is annotated with pathway information up to now. In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database.

Results: We present a classification model, which for a specific gene of interest can predict the mapping to a KEGG pathway, based on its domain signature. The classifier makes explicit use of the hierarchical organization of pathways in the KEGG database. Furthermore, we take into account that a specific gene can be mapped to different pathways at the same time. The classification method produces a scoring of all possible mapping positions of the gene in the KEGG hierarchy. Evaluations of our model, which is a combination of a SVM and ranking perceptron approach, show a high prediction performance. Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components.

Availability: The R package gene2pathway is a supplement to this article.

Show MeSH
Prediction performance of the hierarchical classification model on an external validation set for the pruned KEGG hierarchy (A, 2760 genes) and for signaling pathway components (B, 458 genes).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2553439&req=5

Figure 2: Prediction performance of the hierarchical classification model on an external validation set for the pruned KEGG hierarchy (A, 2760 genes) and for signaling pathway components (B, 458 genes).

Mentions: We applied our method to predict the KEGG pathway membership for a microarray dataset produced in our department: human MCF-7 breast cancer cells were treated with 100 nM tamoxifen for 48 h. On mRNA level effects were measured with in-house developed cDNA two-color microarrays having 26 722 functioning probes (Barth et al., 2006). After variance stabilization normalization (VSN) (Huber et al., 2002) 2937 differentially expressed genes were found with limma (Smyth, 2004) using a Benjamini–Hochberg FDR cutoff of 5% (Benjamini and Hochberg, 1995). Further details on the experiment can be obtained from the authors upon request. The 26 722 probes correspond to 12 692 genes with an Entrez gene ID, of which for 10 057 InterPro annotation and for 2760 KEGG annotation was available. Comparison of our predicted and the original KEGG pathway annotations for the 2760 common genes indicated a very good median accuracy of ∼100% with a median F1-value ∼80% and precision and recall in the same range (Fig. 2A). There were a few outliers, as indicated in the boxplot. These genes are mostly linked to the KEGG category ‘Human Diseases’, which we did not include in our model.


Predicting pathway membership via domain signatures.

Fröhlich H, Fellmann M, Sültmann H, Poustka A, Beissbarth T - Bioinformatics (2008)

Prediction performance of the hierarchical classification model on an external validation set for the pruned KEGG hierarchy (A, 2760 genes) and for signaling pathway components (B, 458 genes).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2553439&req=5

Figure 2: Prediction performance of the hierarchical classification model on an external validation set for the pruned KEGG hierarchy (A, 2760 genes) and for signaling pathway components (B, 458 genes).
Mentions: We applied our method to predict the KEGG pathway membership for a microarray dataset produced in our department: human MCF-7 breast cancer cells were treated with 100 nM tamoxifen for 48 h. On mRNA level effects were measured with in-house developed cDNA two-color microarrays having 26 722 functioning probes (Barth et al., 2006). After variance stabilization normalization (VSN) (Huber et al., 2002) 2937 differentially expressed genes were found with limma (Smyth, 2004) using a Benjamini–Hochberg FDR cutoff of 5% (Benjamini and Hochberg, 1995). Further details on the experiment can be obtained from the authors upon request. The 26 722 probes correspond to 12 692 genes with an Entrez gene ID, of which for 10 057 InterPro annotation and for 2760 KEGG annotation was available. Comparison of our predicted and the original KEGG pathway annotations for the 2760 common genes indicated a very good median accuracy of ∼100% with a median F1-value ∼80% and precision and recall in the same range (Fig. 2A). There were a few outliers, as indicated in the boxplot. These genes are mostly linked to the KEGG category ‘Human Diseases’, which we did not include in our model.

Bottom Line: In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database.Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components.The R package gene2pathway is a supplement to this article.

View Article: PubMed Central - PubMed

Affiliation: German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69120 Heidelberg, Germany. h.froehlich@dkfz-heidelberg.de

ABSTRACT

Motivation: Functional characterization of genes is of great importance for the understanding of complex cellular processes. Valuable information for this purpose can be obtained from pathway databases, like KEGG. However, only a small fraction of genes is annotated with pathway information up to now. In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database.

Results: We present a classification model, which for a specific gene of interest can predict the mapping to a KEGG pathway, based on its domain signature. The classifier makes explicit use of the hierarchical organization of pathways in the KEGG database. Furthermore, we take into account that a specific gene can be mapped to different pathways at the same time. The classification method produces a scoring of all possible mapping positions of the gene in the KEGG hierarchy. Evaluations of our model, which is a combination of a SVM and ranking perceptron approach, show a high prediction performance. Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components.

Availability: The R package gene2pathway is a supplement to this article.

Show MeSH