Limits...
Conceptualization of molecular findings by mining gene annotations.

Chen V, Lu X - BMC Proc (2013)

Bottom Line: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations.We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: The Gene Ontology (GO) is an ontology representing molecular biology concepts related to genes and their products. Current annotations from the GO Consortium tend to be highly specific, and contemporary genome-scale studies often return a long list of genes of potential interest, such as genes in a cancer tumor that are differentially expressed than those found in normal tissue. It is therefore a challenging task to reveal, at a conceptual level, the major functional themes in which genes are involved. Presently, there is a need for tools capable of revealing such themes through mining and representing semantic information in an objective and quantitative manner.

Methods: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations. We cast the task as follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of the genes in a subset are summarized by an informative GO term that accurately captures the semantic information of the original annotations.

Results: We evaluated different metrics for assessing information loss when merging GO terms, and different statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.

Conclusions: Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

No MeSH data available.


Related in: MedlinePlus

An example of a KEGG pathway containing genes involved in multiple processes. The "MAPK signaling pathway" (hsa04010) is shown. Two functionally coherent subsets are highlighted. The genes summarized by GO:0023014 (Signal transduction by phosphorylation) are in green; the genes summarized by GO:0006915 (Apoptotic process) are in blue. Genes involved in both biological processes are in yellow.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4042834&req=5

Figure 4: An example of a KEGG pathway containing genes involved in multiple processes. The "MAPK signaling pathway" (hsa04010) is shown. Two functionally coherent subsets are highlighted. The genes summarized by GO:0023014 (Signal transduction by phosphorylation) are in green; the genes summarized by GO:0006915 (Apoptotic process) are in blue. Genes involved in both biological processes are in yellow.

Mentions: In the previous section, we used the gene sets from the KEGG pathway database as the surrogates of coherent gene sets to compare the discriminative power of statistical models utilizing different combinations of information loss metrics and graph-based statistics. However, we noted that many KEGG pathways contain a large number of genes performing diverse functions, and our model classified them as non-coherent gene sets. Instead of simply treating such calls by our model as errors, we further investigated whether it made sense to treat a large KEGG pathway gene set as a coherence gene set, and whether it is more sensible to use our approach to identify fine-grained, coherent gene sets from such a pathway. Figure 4 shows an example of one such KEGG pathway (hsa04010): the human MAPK signaling pathway, which includes 262 unique genes (not all are shown in the figure). This KEGG pathway comprehensively includes many cellular signal transduction components sharing the proteins involved in the MAPK cascade, including growth factor signaling pathways and the signaling pathway that induces apoptosis. As such, it may not be biologically sensible or even possible to find an informative concept to summarize the diverse biological processes of genes in this KEGG database entry. Indeed, we tried to search for a GO term to cover all the genes listed in this pathway, a process which led to the most uninformative term of the Biological Processes domain, the root GO term. Therefore, it is sensible that our model treated the whole set of genes as not coherently related. When applying our method to the gene list of this pathway to search for coherent subsets, our model returned a number of non-disjoint gene subsets, reflecting different aspects of the functions in which these genes participate. For example, one subset was summarized by the GO term GO:0023014 (signal transduction by phosphorylation), which included most of the genes in the pathway that are involved in the protein phosphorylation process, including MAPK kinases, and are shown as the genes in green. Another facet of the functions of these genes was summarized by the GO term GO:0006915 (apoptosis), which included many genes with well-known roles in the process of apoptosis, and are shown as the genes in blue. Thus, in this figure, these two concepts reflect two functional themes of the genes with a suitable level of specificity and generality; the rest of the concepts representing different functional themes are listed in Additional File 1.


Conceptualization of molecular findings by mining gene annotations.

Chen V, Lu X - BMC Proc (2013)

An example of a KEGG pathway containing genes involved in multiple processes. The "MAPK signaling pathway" (hsa04010) is shown. Two functionally coherent subsets are highlighted. The genes summarized by GO:0023014 (Signal transduction by phosphorylation) are in green; the genes summarized by GO:0006915 (Apoptotic process) are in blue. Genes involved in both biological processes are in yellow.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4042834&req=5

Figure 4: An example of a KEGG pathway containing genes involved in multiple processes. The "MAPK signaling pathway" (hsa04010) is shown. Two functionally coherent subsets are highlighted. The genes summarized by GO:0023014 (Signal transduction by phosphorylation) are in green; the genes summarized by GO:0006915 (Apoptotic process) are in blue. Genes involved in both biological processes are in yellow.
Mentions: In the previous section, we used the gene sets from the KEGG pathway database as the surrogates of coherent gene sets to compare the discriminative power of statistical models utilizing different combinations of information loss metrics and graph-based statistics. However, we noted that many KEGG pathways contain a large number of genes performing diverse functions, and our model classified them as non-coherent gene sets. Instead of simply treating such calls by our model as errors, we further investigated whether it made sense to treat a large KEGG pathway gene set as a coherence gene set, and whether it is more sensible to use our approach to identify fine-grained, coherent gene sets from such a pathway. Figure 4 shows an example of one such KEGG pathway (hsa04010): the human MAPK signaling pathway, which includes 262 unique genes (not all are shown in the figure). This KEGG pathway comprehensively includes many cellular signal transduction components sharing the proteins involved in the MAPK cascade, including growth factor signaling pathways and the signaling pathway that induces apoptosis. As such, it may not be biologically sensible or even possible to find an informative concept to summarize the diverse biological processes of genes in this KEGG database entry. Indeed, we tried to search for a GO term to cover all the genes listed in this pathway, a process which led to the most uninformative term of the Biological Processes domain, the root GO term. Therefore, it is sensible that our model treated the whole set of genes as not coherently related. When applying our method to the gene list of this pathway to search for coherent subsets, our model returned a number of non-disjoint gene subsets, reflecting different aspects of the functions in which these genes participate. For example, one subset was summarized by the GO term GO:0023014 (signal transduction by phosphorylation), which included most of the genes in the pathway that are involved in the protein phosphorylation process, including MAPK kinases, and are shown as the genes in green. Another facet of the functions of these genes was summarized by the GO term GO:0006915 (apoptosis), which included many genes with well-known roles in the process of apoptosis, and are shown as the genes in blue. Thus, in this figure, these two concepts reflect two functional themes of the genes with a suitable level of specificity and generality; the rest of the concepts representing different functional themes are listed in Additional File 1.

Bottom Line: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations.We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: The Gene Ontology (GO) is an ontology representing molecular biology concepts related to genes and their products. Current annotations from the GO Consortium tend to be highly specific, and contemporary genome-scale studies often return a long list of genes of potential interest, such as genes in a cancer tumor that are differentially expressed than those found in normal tissue. It is therefore a challenging task to reveal, at a conceptual level, the major functional themes in which genes are involved. Presently, there is a need for tools capable of revealing such themes through mining and representing semantic information in an objective and quantitative manner.

Methods: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations. We cast the task as follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of the genes in a subset are summarized by an informative GO term that accurately captures the semantic information of the original annotations.

Results: We evaluated different metrics for assessing information loss when merging GO terms, and different statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.

Conclusions: Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

No MeSH data available.


Related in: MedlinePlus