Limits...
ProbCD: enrichment analysis accounting for categorization uncertainty.

Vêncio RZ, Shmulevich I - BMC Bioinformatics (2007)

Bottom Line: We developed an open-source R-based software to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table.The contingency table for the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities.In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Systems Biology, 1441 North 34th street, Seattle, WA 98103-8904, USA. rvencio@gmail.com

ABSTRACT

Background: As in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. In spite of efforts to create probabilistic annotations, especially in the Gene Ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the Fisher Exact Test.

Results: We developed an open-source R-based software to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table. The contingency table for the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities. An on-line interface was created to allow usage by non-programmers and is available at: http://xerad.systemsbiology.net/ProbCD/.

Conclusion: We present an analysis framework and software tools to address the issue of uncertainty in categorical data analysis. In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation.

Show MeSH
Venn diagram of over-represented terms. The Venn diagram shows the number of GO terms considered significantly over-represented (p-value ≤ 0.01) by the Fisher Exact Test using four different probability cutoffs ℙ (genei is periodic) ≥ A, B, C or D ⇒ periodic: A = 0.70, B = 0.95, C = 0.99 and D = 0.9999.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2169266&req=5

Figure 2: Venn diagram of over-represented terms. The Venn diagram shows the number of GO terms considered significantly over-represented (p-value ≤ 0.01) by the Fisher Exact Test using four different probability cutoffs ℙ (genei is periodic) ≥ A, B, C or D ⇒ periodic: A = 0.70, B = 0.95, C = 0.99 and D = 0.9999.

Mentions: The above analysis process is repeated for all GO terms, with the results available as Additional Files and summarized in Figure 2. This figure suggests that there is a large variability in the possible final outcome of an enrichment analysis depending on the probability cutoff used to build the associated contingency table. This variability is avoided by ProbCD because it directly takes into account the uncertainty in the data instead of introducing a discretization step (Figure 1).


ProbCD: enrichment analysis accounting for categorization uncertainty.

Vêncio RZ, Shmulevich I - BMC Bioinformatics (2007)

Venn diagram of over-represented terms. The Venn diagram shows the number of GO terms considered significantly over-represented (p-value ≤ 0.01) by the Fisher Exact Test using four different probability cutoffs ℙ (genei is periodic) ≥ A, B, C or D ⇒ periodic: A = 0.70, B = 0.95, C = 0.99 and D = 0.9999.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2169266&req=5

Figure 2: Venn diagram of over-represented terms. The Venn diagram shows the number of GO terms considered significantly over-represented (p-value ≤ 0.01) by the Fisher Exact Test using four different probability cutoffs ℙ (genei is periodic) ≥ A, B, C or D ⇒ periodic: A = 0.70, B = 0.95, C = 0.99 and D = 0.9999.
Mentions: The above analysis process is repeated for all GO terms, with the results available as Additional Files and summarized in Figure 2. This figure suggests that there is a large variability in the possible final outcome of an enrichment analysis depending on the probability cutoff used to build the associated contingency table. This variability is avoided by ProbCD because it directly takes into account the uncertainty in the data instead of introducing a discretization step (Figure 1).

Bottom Line: We developed an open-source R-based software to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table.The contingency table for the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities.In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Systems Biology, 1441 North 34th street, Seattle, WA 98103-8904, USA. rvencio@gmail.com

ABSTRACT

Background: As in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. In spite of efforts to create probabilistic annotations, especially in the Gene Ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the Fisher Exact Test.

Results: We developed an open-source R-based software to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table. The contingency table for the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities. An on-line interface was created to allow usage by non-programmers and is available at: http://xerad.systemsbiology.net/ProbCD/.

Conclusion: We present an analysis framework and software tools to address the issue of uncertainty in categorical data analysis. In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation.

Show MeSH