Limits...
Conceptualization of molecular findings by mining gene annotations.

Chen V, Lu X - BMC Proc (2013)

Bottom Line: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations.We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: The Gene Ontology (GO) is an ontology representing molecular biology concepts related to genes and their products. Current annotations from the GO Consortium tend to be highly specific, and contemporary genome-scale studies often return a long list of genes of potential interest, such as genes in a cancer tumor that are differentially expressed than those found in normal tissue. It is therefore a challenging task to reveal, at a conceptual level, the major functional themes in which genes are involved. Presently, there is a need for tools capable of revealing such themes through mining and representing semantic information in an objective and quantitative manner.

Methods: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations. We cast the task as follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of the genes in a subset are summarized by an informative GO term that accurately captures the semantic information of the original annotations.

Results: We evaluated different metrics for assessing information loss when merging GO terms, and different statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.

Conclusions: Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

No MeSH data available.


Related in: MedlinePlus

Example of distributions of graph-based statistics and the discriminative power of coherence models. A. In the scatter plot (top left), statistics derived from KEGG gene sets (red), the matching random gene sets (blue), and simulated random gene sets during model-building (green) were plotted. Top right is the corresponding ROC curve. B. Scatter plot of the graph-based statistics and ROC curve of the model using IC as information loss and Steiner tree length derived from augmented GO graph.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4042834&req=5

Figure 3: Example of distributions of graph-based statistics and the discriminative power of coherence models. A. In the scatter plot (top left), statistics derived from KEGG gene sets (red), the matching random gene sets (blue), and simulated random gene sets during model-building (green) were plotted. Top right is the corresponding ROC curve. B. Scatter plot of the graph-based statistics and ROC curve of the model using IC as information loss and Steiner tree length derived from augmented GO graph.

Mentions: Figure 3 shows the results of the analyses of combining IC-based information-loss measures with different graph-based statistical schemes. In Panel A (top two plots), we combined the use of IC as information loss with the length of Steiner trees derived from a GOGeneGraph graph as statistic. The figure shows that this combination cannot separate KEGG pathway gene sets from the matched random and simulated random gene sets. Thus, the discriminative power of the model is poor, as shown in the ROC curve. In contrast, the combination of the use of IC as the information loss metric and the length of Steiner trees derived from an augmented GOGeneGraph as statistic exhibited significant differences in the distribution of the data points from the KEGG gene sets and random gene sets. Thus, the corresponding model revealed a much higher discriminative power. The results of ROC analysis of all pairwise combinations of information loss metrics and graph-based statistics schemes are shown in Table 1; they indicate that the combination of IC and Steiner tree from an augmented GO graph performs the best.


Conceptualization of molecular findings by mining gene annotations.

Chen V, Lu X - BMC Proc (2013)

Example of distributions of graph-based statistics and the discriminative power of coherence models. A. In the scatter plot (top left), statistics derived from KEGG gene sets (red), the matching random gene sets (blue), and simulated random gene sets during model-building (green) were plotted. Top right is the corresponding ROC curve. B. Scatter plot of the graph-based statistics and ROC curve of the model using IC as information loss and Steiner tree length derived from augmented GO graph.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4042834&req=5

Figure 3: Example of distributions of graph-based statistics and the discriminative power of coherence models. A. In the scatter plot (top left), statistics derived from KEGG gene sets (red), the matching random gene sets (blue), and simulated random gene sets during model-building (green) were plotted. Top right is the corresponding ROC curve. B. Scatter plot of the graph-based statistics and ROC curve of the model using IC as information loss and Steiner tree length derived from augmented GO graph.
Mentions: Figure 3 shows the results of the analyses of combining IC-based information-loss measures with different graph-based statistical schemes. In Panel A (top two plots), we combined the use of IC as information loss with the length of Steiner trees derived from a GOGeneGraph graph as statistic. The figure shows that this combination cannot separate KEGG pathway gene sets from the matched random and simulated random gene sets. Thus, the discriminative power of the model is poor, as shown in the ROC curve. In contrast, the combination of the use of IC as the information loss metric and the length of Steiner trees derived from an augmented GOGeneGraph as statistic exhibited significant differences in the distribution of the data points from the KEGG gene sets and random gene sets. Thus, the corresponding model revealed a much higher discriminative power. The results of ROC analysis of all pairwise combinations of information loss metrics and graph-based statistics schemes are shown in Table 1; they indicate that the combination of IC and Steiner tree from an augmented GO graph performs the best.

Bottom Line: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations.We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: The Gene Ontology (GO) is an ontology representing molecular biology concepts related to genes and their products. Current annotations from the GO Consortium tend to be highly specific, and contemporary genome-scale studies often return a long list of genes of potential interest, such as genes in a cancer tumor that are differentially expressed than those found in normal tissue. It is therefore a challenging task to reveal, at a conceptual level, the major functional themes in which genes are involved. Presently, there is a need for tools capable of revealing such themes through mining and representing semantic information in an objective and quantitative manner.

Methods: In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations. We cast the task as follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of the genes in a subset are summarized by an informative GO term that accurately captures the semantic information of the original annotations.

Results: We evaluated different metrics for assessing information loss when merging GO terms, and different statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph.

Conclusions: Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.

No MeSH data available.


Related in: MedlinePlus