Limits...
A probabilistic generative model for GO enrichment analysis.

Lu Y, Rosenfeld R, Simon I, Nau GJ, Bar-Joseph Z - Nucleic Acids Res. (2008)

Bottom Line: This makes it hard to determine if the identified significant categories represent different functional outcomes or rather a redundant view of the same biological processes.Our model accommodates noise and errors in the selected gene set and GO.When used with microarray expression data and ChIP-chip data from yeast and human our method was able to correctly identify both general and specific enriched categories which were overlooked by other methods.

View Article: PubMed Central - PubMed

Affiliation: Computer Science Department, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, 15213 USA.

ABSTRACT
The Gene Ontology (GO) is extensively used to analyze all types of high-throughput experiments. However, researchers still face several challenges when using GO and other functional annotation databases. One problem is the large number of multiple hypotheses that are being tested for each study. In addition, categories often overlap with both direct parents/descendents and other distant categories in the hierarchical structure. This makes it hard to determine if the identified significant categories represent different functional outcomes or rather a redundant view of the same biological processes. To overcome these problems we developed a generative probabilistic model which identifies a (small) subset of categories that, together, explain the selected gene set. Our model accommodates noise and errors in the selected gene set and GO. Using controlled GO data our method correctly recovered most of the selected categories, leading to dramatic improvements over current methods for GO analysis. When used with microarray expression data and ChIP-chip data from yeast and human our method was able to correctly identify both general and specific enriched categories which were overlooked by other methods.

Show MeSH
Construction of an activation graph. (a) A diagram showing a GO hierarchy of four categories and the five genes annotated by these categories (letters in each rectangle). Because of the ‘true path’ rule, each gene annotated by a category in the GO hierarchy is also annotated by all its parent categories. (b) The activation graph corresponding to this GO hierarchy when observing three of the genes (A,B,C). In this graph, we connect a gene node with a GO node if and only if the gene is annotated by that GO category. For this set of genes the active category is determined to be the orange category. Note that due to noise there is a gene that is selected even though it does not belong to the active category (A). Noise is also responsible for the fact that a gene belonging to the active category is not selected (D).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2553574&req=5

Figure 1: Construction of an activation graph. (a) A diagram showing a GO hierarchy of four categories and the five genes annotated by these categories (letters in each rectangle). Because of the ‘true path’ rule, each gene annotated by a category in the GO hierarchy is also annotated by all its parent categories. (b) The activation graph corresponding to this GO hierarchy when observing three of the genes (A,B,C). In this graph, we connect a gene node with a GO node if and only if the gene is annotated by that GO category. For this set of genes the active category is determined to be the orange category. Note that due to noise there is a gene that is selected even though it does not belong to the active category (A). Noise is also responsible for the fact that a gene belonging to the active category is not selected (D).

Mentions: To explain our method, one can think of this problem in terms of a bi-partite graph representing the relationships between GO categories and genes (Figure 1). Nodes on the left side of the graph represent GO categories and nodes on the right represent all genes annotated in that species. We connect a gene node with a GO node by an edge if and only if the gene is annotated to belong to that GO category. We denote genes that were identified in the experiment as ‘ON’ or active and genes that were not identified as ‘OFF’ or inactive. Similarly, when a biological process (corresponding to a specific GO category) is active, we represent it by setting its GO node to ‘ON’ and when it is inactive, we set its state to ‘OFF’.Figure 1.


A probabilistic generative model for GO enrichment analysis.

Lu Y, Rosenfeld R, Simon I, Nau GJ, Bar-Joseph Z - Nucleic Acids Res. (2008)

Construction of an activation graph. (a) A diagram showing a GO hierarchy of four categories and the five genes annotated by these categories (letters in each rectangle). Because of the ‘true path’ rule, each gene annotated by a category in the GO hierarchy is also annotated by all its parent categories. (b) The activation graph corresponding to this GO hierarchy when observing three of the genes (A,B,C). In this graph, we connect a gene node with a GO node if and only if the gene is annotated by that GO category. For this set of genes the active category is determined to be the orange category. Note that due to noise there is a gene that is selected even though it does not belong to the active category (A). Noise is also responsible for the fact that a gene belonging to the active category is not selected (D).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2553574&req=5

Figure 1: Construction of an activation graph. (a) A diagram showing a GO hierarchy of four categories and the five genes annotated by these categories (letters in each rectangle). Because of the ‘true path’ rule, each gene annotated by a category in the GO hierarchy is also annotated by all its parent categories. (b) The activation graph corresponding to this GO hierarchy when observing three of the genes (A,B,C). In this graph, we connect a gene node with a GO node if and only if the gene is annotated by that GO category. For this set of genes the active category is determined to be the orange category. Note that due to noise there is a gene that is selected even though it does not belong to the active category (A). Noise is also responsible for the fact that a gene belonging to the active category is not selected (D).
Mentions: To explain our method, one can think of this problem in terms of a bi-partite graph representing the relationships between GO categories and genes (Figure 1). Nodes on the left side of the graph represent GO categories and nodes on the right represent all genes annotated in that species. We connect a gene node with a GO node by an edge if and only if the gene is annotated to belong to that GO category. We denote genes that were identified in the experiment as ‘ON’ or active and genes that were not identified as ‘OFF’ or inactive. Similarly, when a biological process (corresponding to a specific GO category) is active, we represent it by setting its GO node to ‘ON’ and when it is inactive, we set its state to ‘OFF’.Figure 1.

Bottom Line: This makes it hard to determine if the identified significant categories represent different functional outcomes or rather a redundant view of the same biological processes.Our model accommodates noise and errors in the selected gene set and GO.When used with microarray expression data and ChIP-chip data from yeast and human our method was able to correctly identify both general and specific enriched categories which were overlooked by other methods.

View Article: PubMed Central - PubMed

Affiliation: Computer Science Department, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, 15213 USA.

ABSTRACT
The Gene Ontology (GO) is extensively used to analyze all types of high-throughput experiments. However, researchers still face several challenges when using GO and other functional annotation databases. One problem is the large number of multiple hypotheses that are being tested for each study. In addition, categories often overlap with both direct parents/descendents and other distant categories in the hierarchical structure. This makes it hard to determine if the identified significant categories represent different functional outcomes or rather a redundant view of the same biological processes. To overcome these problems we developed a generative probabilistic model which identifies a (small) subset of categories that, together, explain the selected gene set. Our model accommodates noise and errors in the selected gene set and GO. Using controlled GO data our method correctly recovered most of the selected categories, leading to dramatic improvements over current methods for GO analysis. When used with microarray expression data and ChIP-chip data from yeast and human our method was able to correctly identify both general and specific enriched categories which were overlooked by other methods.

Show MeSH