Limits...
Conceptualization of molecular findings by mining gene annotations.

Chen V, Lu X - BMC Proc (2013)

Bottom Line: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations.We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: The Gene Ontology (GO) is an ontology representing molecular biology concepts related to genes and their products. Current annotations from the GO Consortium tend to be highly specific, and contemporary genome-scale studies often return a long list of genes of potential interest, such as genes in a cancer tumor that are differentially expressed than those found in normal tissue. It is therefore a challenging task to reveal, at a conceptual level, the major functional themes in which genes are involved. Presently, there is a need for tools capable of revealing such themes through mining and representing semantic information in an objective and quantitative manner.

Methods: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations. We cast the task as follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of the genes in a subset are summarized by an informative GO term that accurately captures the semantic information of the original annotations.

Results: We evaluated different metrics for assessing information loss when merging GO terms, and different statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.

Conclusions: Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

No MeSH data available.


Related in: MedlinePlus

Distribution of Edge weights. Both A and B are organized with the IB-based edge weight plot on the left and the IC-based edge weight plot on the right. A. Distribution of the shortest 90% of edges in the entire graph. B. Boxplots of the edge weight distribution organized according to the level of edge, where level 0 contains edges that connect to the root.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4042834&req=5

Figure 2: Distribution of Edge weights. Both A and B are organized with the IB-based edge weight plot on the left and the IC-based edge weight plot on the right. A. Distribution of the shortest 90% of edges in the entire graph. B. Boxplots of the edge weight distribution organized according to the level of edge, where level 0 contains edges that connect to the root.

Mentions: We first set out to assess which of the two information-loss measures, the IC-based and the IB-based, best fit our goal of assessing information loss when genes annotated by highly specific GO terms were merged under a general GO term. We compared the distribution patterns of edge weights represented in different metrics to study their characteristics. Figure 2A shows the histograms of the edges conditioned on edge lengths when calculated using either IB-based (left panel) or IC-based information loss (right panel). The figure shows that the numeric scales of the two metrics are of different orders of magnitude. It also shows that the distributions are mainly dominated by edges with relatively short distances. This was the finding we anticipated, because there are more edges close to the leaf level in the GO hierarchy, where the differences in terms of semantic context or protein information are expected to be small. The distribution of the IB-based edge weights exhibits a smoother transition, whereas the IC-based edge weights demonstrate a peak at edge length zero and a quick drop afterwards, and the distribution contains certain spikes.


Conceptualization of molecular findings by mining gene annotations.

Chen V, Lu X - BMC Proc (2013)

Distribution of Edge weights. Both A and B are organized with the IB-based edge weight plot on the left and the IC-based edge weight plot on the right. A. Distribution of the shortest 90% of edges in the entire graph. B. Boxplots of the edge weight distribution organized according to the level of edge, where level 0 contains edges that connect to the root.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4042834&req=5

Figure 2: Distribution of Edge weights. Both A and B are organized with the IB-based edge weight plot on the left and the IC-based edge weight plot on the right. A. Distribution of the shortest 90% of edges in the entire graph. B. Boxplots of the edge weight distribution organized according to the level of edge, where level 0 contains edges that connect to the root.
Mentions: We first set out to assess which of the two information-loss measures, the IC-based and the IB-based, best fit our goal of assessing information loss when genes annotated by highly specific GO terms were merged under a general GO term. We compared the distribution patterns of edge weights represented in different metrics to study their characteristics. Figure 2A shows the histograms of the edges conditioned on edge lengths when calculated using either IB-based (left panel) or IC-based information loss (right panel). The figure shows that the numeric scales of the two metrics are of different orders of magnitude. It also shows that the distributions are mainly dominated by edges with relatively short distances. This was the finding we anticipated, because there are more edges close to the leaf level in the GO hierarchy, where the differences in terms of semantic context or protein information are expected to be small. The distribution of the IB-based edge weights exhibits a smoother transition, whereas the IC-based edge weights demonstrate a peak at edge length zero and a quick drop afterwards, and the distribution contains certain spikes.

Bottom Line: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations.We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: The Gene Ontology (GO) is an ontology representing molecular biology concepts related to genes and their products. Current annotations from the GO Consortium tend to be highly specific, and contemporary genome-scale studies often return a long list of genes of potential interest, such as genes in a cancer tumor that are differentially expressed than those found in normal tissue. It is therefore a challenging task to reveal, at a conceptual level, the major functional themes in which genes are involved. Presently, there is a need for tools capable of revealing such themes through mining and representing semantic information in an objective and quantitative manner.

Methods: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations. We cast the task as follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of the genes in a subset are summarized by an informative GO term that accurately captures the semantic information of the original annotations.

Results: We evaluated different metrics for assessing information loss when merging GO terms, and different statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.

Conclusions: Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

No MeSH data available.


Related in: MedlinePlus