Limits...
Conceptualization of molecular findings by mining gene annotations.

Chen V, Lu X - BMC Proc (2013)

Bottom Line: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations.We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: The Gene Ontology (GO) is an ontology representing molecular biology concepts related to genes and their products. Current annotations from the GO Consortium tend to be highly specific, and contemporary genome-scale studies often return a long list of genes of potential interest, such as genes in a cancer tumor that are differentially expressed than those found in normal tissue. It is therefore a challenging task to reveal, at a conceptual level, the major functional themes in which genes are involved. Presently, there is a need for tools capable of revealing such themes through mining and representing semantic information in an objective and quantitative manner.

Methods: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations. We cast the task as follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of the genes in a subset are summarized by an informative GO term that accurately captures the semantic information of the original annotations.

Results: We evaluated different metrics for assessing information loss when merging GO terms, and different statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.

Conclusions: Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

No MeSH data available.


Related in: MedlinePlus

Distribution of genes associated with summarizing GO terms. A. Boxplots of the distribution of number of genes associated per term under five different conditions: original GO annotation; enriched GO annotations; our method with a p-value ≤ 0.01 and 0.05 thresholds; and the Generic GO slim. B. Plot of the proportion of GO terms per level in the GO hierarchy under the five different conditions.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4042834&req=5

Figure 5: Distribution of genes associated with summarizing GO terms. A. Boxplots of the distribution of number of genes associated per term under five different conditions: original GO annotation; enriched GO annotations; our method with a p-value ≤ 0.01 and 0.05 thresholds; and the Generic GO slim. B. Plot of the proportion of GO terms per level in the GO hierarchy under the five different conditions.

Mentions: As a concrete example, we identified a list of genes from the ovarian cancer samples collected from TCGA. The list included a total of 837 genes, each of which was deferentially expressed in at least 5 tumor samples [31]. We first set out to evaluate whether it was suitable to use the original GO annotations, GO annotation enrichment analysis, and GO slim mapping to identify function themes. We found that a total of 2,175 unique GO terms from the BP domain were associated with the genes in the list, and that the median number of genes annotated by these GO terms was 1. The distribution of the number of genes annotated by these terms is shown as a box plot, labeled as "original", in Figure 5A. We then performed a conventional hypergeometric-distribution-based GO term enrichment analysis to identify the "enriched" GO terms at a cutoff p-value of 0.05. The analysis resulted in a set of 433 unique GO terms, the median number of genes annotated by which was 4 (see Figure 5A). Since the size of gene modules associated with each enriched GO term appeared to be too small to represent the "major" themes, we further studied the genes mapped to the human GO slim terms, finding a total of 70 human GO slim terms, the median number of genes mapped to which was 75.


Conceptualization of molecular findings by mining gene annotations.

Chen V, Lu X - BMC Proc (2013)

Distribution of genes associated with summarizing GO terms. A. Boxplots of the distribution of number of genes associated per term under five different conditions: original GO annotation; enriched GO annotations; our method with a p-value ≤ 0.01 and 0.05 thresholds; and the Generic GO slim. B. Plot of the proportion of GO terms per level in the GO hierarchy under the five different conditions.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4042834&req=5

Figure 5: Distribution of genes associated with summarizing GO terms. A. Boxplots of the distribution of number of genes associated per term under five different conditions: original GO annotation; enriched GO annotations; our method with a p-value ≤ 0.01 and 0.05 thresholds; and the Generic GO slim. B. Plot of the proportion of GO terms per level in the GO hierarchy under the five different conditions.
Mentions: As a concrete example, we identified a list of genes from the ovarian cancer samples collected from TCGA. The list included a total of 837 genes, each of which was deferentially expressed in at least 5 tumor samples [31]. We first set out to evaluate whether it was suitable to use the original GO annotations, GO annotation enrichment analysis, and GO slim mapping to identify function themes. We found that a total of 2,175 unique GO terms from the BP domain were associated with the genes in the list, and that the median number of genes annotated by these GO terms was 1. The distribution of the number of genes annotated by these terms is shown as a box plot, labeled as "original", in Figure 5A. We then performed a conventional hypergeometric-distribution-based GO term enrichment analysis to identify the "enriched" GO terms at a cutoff p-value of 0.05. The analysis resulted in a set of 433 unique GO terms, the median number of genes annotated by which was 4 (see Figure 5A). Since the size of gene modules associated with each enriched GO term appeared to be too small to represent the "major" themes, we further studied the genes mapped to the human GO slim terms, finding a total of 70 human GO slim terms, the median number of genes mapped to which was 75.

Bottom Line: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations.We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: The Gene Ontology (GO) is an ontology representing molecular biology concepts related to genes and their products. Current annotations from the GO Consortium tend to be highly specific, and contemporary genome-scale studies often return a long list of genes of potential interest, such as genes in a cancer tumor that are differentially expressed than those found in normal tissue. It is therefore a challenging task to reveal, at a conceptual level, the major functional themes in which genes are involved. Presently, there is a need for tools capable of revealing such themes through mining and representing semantic information in an objective and quantitative manner.

Methods: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations. We cast the task as follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of the genes in a subset are summarized by an informative GO term that accurately captures the semantic information of the original annotations.

Results: We evaluated different metrics for assessing information loss when merging GO terms, and different statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.

Conclusions: Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

No MeSH data available.


Related in: MedlinePlus