Limits...
Gene Vector Analysis (Geneva): a unified method to detect differentially-regulated gene sets and similar microarray experiments.

Tanner SW, Agarwal P - BMC Bioinformatics (2008)

Bottom Line: This new method works on both gene sets and on gene lists/vectors as input queries, and can effectively query databases consisting of sets of biologically related sets, or of results from other microarray experiments.We also provide a standard evaluation data set based on 5 pairs of related experiments that should share similar functional relationships and 28 pairs of unrelated experiments from GEO.Discovering relationships amongst GEO data sets has implications for drug repositioning, and understanding relationships between diseases and drugs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Bioinformatics program, University of California, San Diego, La Jolla, CA 92093-0419, USA. stanner@ucsd.edu

ABSTRACT

Background: Microarray experiments measure changes in the expression of thousands of genes. The resulting lists of genes with changes in expression are then searched for biologically related sets using several divergent methods such as the Fisher Exact Test (as used in multiple GO enrichment tools), Parametric Analysis of Gene Expression (PAGE), Gene Set Enrichment Analysis (GSEA), and the connectivity map.

Results: We describe an analytical method (Geneva: Gene Vector Analysis) to relate genes to biological properties and to other similar experiments in a uniform way. This new method works on both gene sets and on gene lists/vectors as input queries, and can effectively query databases consisting of sets of biologically related sets, or of results from other microarray experiments. We also present an improvement to the model estimate by using the empirical background distribution drawn from previous experiments. We validated Geneva by rediscovering a number of previous findings, and by finding significant relationships within microarrays in the GEO repository.

Conclusion: Provided a reasonable corpus of previous experiments is available, this method is more accurate than the class label permutation model, especially for data sets with limited number of replicates. Geneva is, moreover, computationally faster because the background distributions can be precomputed. We also provide a standard evaluation data set based on 5 pairs of related experiments that should share similar functional relationships and 28 pairs of unrelated experiments from GEO. Discovering relationships amongst GEO data sets has implications for drug repositioning, and understanding relationships between diseases and drugs.

Show MeSH

Related in: MedlinePlus

Precision (1-FDR) of gene set queries methods with Pearson based p-values calibrated on GEO and CMAP compared to permutation based p-values and no permutation. Also included, for comparison are GSEA q-value (based on FDR) and GSEA p-value (based on FWER). We compared Precision across the various queries for threshold of gene sets (N) plotted on the x-axis (as described in Methods).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2543031&req=5

Figure 2: Precision (1-FDR) of gene set queries methods with Pearson based p-values calibrated on GEO and CMAP compared to permutation based p-values and no permutation. Also included, for comparison are GSEA q-value (based on FDR) and GSEA p-value (based on FWER). We compared Precision across the various queries for threshold of gene sets (N) plotted on the x-axis (as described in Methods).

Mentions: As described in the Methods, we computed the false discovery rate for queries across pairs of related and unrelated experiments (Figure 2). If p-value calibration is performed, many more gene sets are observed at a 10% false discovery rate. The results also demonstrate that either large corpus provides a reasonable training set for p-value calibration, as queries calibrated with either GEO or CMAP perform significantly better than those without calibration. The GEO corpus has the advantage that it includes a wide array of treatments and tissue types, and that it uses t-scores (available only for the GEO corpus) rather than fold changes (available for the CMAP corpus). On the other hand, the CMAP corpus is somewhat larger, and has the advantage that it was generated by one lab with high reproducibility. The GEO corpus was arguably more effective, as measured by the slower decrease in precision. However, when we list the top 10 gene sets for these experiments (as measured by product of p-values), the lists reported using calibration against the CMAP corpus appeared to be most biologically reasonable (in our subjective opinion).


Gene Vector Analysis (Geneva): a unified method to detect differentially-regulated gene sets and similar microarray experiments.

Tanner SW, Agarwal P - BMC Bioinformatics (2008)

Precision (1-FDR) of gene set queries methods with Pearson based p-values calibrated on GEO and CMAP compared to permutation based p-values and no permutation. Also included, for comparison are GSEA q-value (based on FDR) and GSEA p-value (based on FWER). We compared Precision across the various queries for threshold of gene sets (N) plotted on the x-axis (as described in Methods).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2543031&req=5

Figure 2: Precision (1-FDR) of gene set queries methods with Pearson based p-values calibrated on GEO and CMAP compared to permutation based p-values and no permutation. Also included, for comparison are GSEA q-value (based on FDR) and GSEA p-value (based on FWER). We compared Precision across the various queries for threshold of gene sets (N) plotted on the x-axis (as described in Methods).
Mentions: As described in the Methods, we computed the false discovery rate for queries across pairs of related and unrelated experiments (Figure 2). If p-value calibration is performed, many more gene sets are observed at a 10% false discovery rate. The results also demonstrate that either large corpus provides a reasonable training set for p-value calibration, as queries calibrated with either GEO or CMAP perform significantly better than those without calibration. The GEO corpus has the advantage that it includes a wide array of treatments and tissue types, and that it uses t-scores (available only for the GEO corpus) rather than fold changes (available for the CMAP corpus). On the other hand, the CMAP corpus is somewhat larger, and has the advantage that it was generated by one lab with high reproducibility. The GEO corpus was arguably more effective, as measured by the slower decrease in precision. However, when we list the top 10 gene sets for these experiments (as measured by product of p-values), the lists reported using calibration against the CMAP corpus appeared to be most biologically reasonable (in our subjective opinion).

Bottom Line: This new method works on both gene sets and on gene lists/vectors as input queries, and can effectively query databases consisting of sets of biologically related sets, or of results from other microarray experiments.We also provide a standard evaluation data set based on 5 pairs of related experiments that should share similar functional relationships and 28 pairs of unrelated experiments from GEO.Discovering relationships amongst GEO data sets has implications for drug repositioning, and understanding relationships between diseases and drugs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Bioinformatics program, University of California, San Diego, La Jolla, CA 92093-0419, USA. stanner@ucsd.edu

ABSTRACT

Background: Microarray experiments measure changes in the expression of thousands of genes. The resulting lists of genes with changes in expression are then searched for biologically related sets using several divergent methods such as the Fisher Exact Test (as used in multiple GO enrichment tools), Parametric Analysis of Gene Expression (PAGE), Gene Set Enrichment Analysis (GSEA), and the connectivity map.

Results: We describe an analytical method (Geneva: Gene Vector Analysis) to relate genes to biological properties and to other similar experiments in a uniform way. This new method works on both gene sets and on gene lists/vectors as input queries, and can effectively query databases consisting of sets of biologically related sets, or of results from other microarray experiments. We also present an improvement to the model estimate by using the empirical background distribution drawn from previous experiments. We validated Geneva by rediscovering a number of previous findings, and by finding significant relationships within microarrays in the GEO repository.

Conclusion: Provided a reasonable corpus of previous experiments is available, this method is more accurate than the class label permutation model, especially for data sets with limited number of replicates. Geneva is, moreover, computationally faster because the background distributions can be precomputed. We also provide a standard evaluation data set based on 5 pairs of related experiments that should share similar functional relationships and 28 pairs of unrelated experiments from GEO. Discovering relationships amongst GEO data sets has implications for drug repositioning, and understanding relationships between diseases and drugs.

Show MeSH
Related in: MedlinePlus