Limits...
ProbCD: enrichment analysis accounting for categorization uncertainty.

VĂȘncio RZ, Shmulevich I - BMC Bioinformatics (2007)

Bottom Line: We developed an open-source R-based software to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table.The contingency table for the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities.In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Systems Biology, 1441 North 34th street, Seattle, WA 98103-8904, USA. rvencio@gmail.com

ABSTRACT

Background: As in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. In spite of efforts to create probabilistic annotations, especially in the Gene Ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the Fisher Exact Test.

Results: We developed an open-source R-based software to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table. The contingency table for the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities. An on-line interface was created to allow usage by non-programmers and is available at: http://xerad.systemsbiology.net/ProbCD/.

Conclusion: We present an analysis framework and software tools to address the issue of uncertainty in categorical data analysis. In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation.

Show MeSH
Fraction of "cell-cycle" GO terms selected as a function of the p-value. The curves show the fraction of GO terms containing the word "cell-cycle" in their definition that are considered significant as a function of the significance cutoff (p). The red curve is obtained with ProbCD and all others are obtained with one of the probability cutoffs: 50%, 70%, 95%, 99% or 99.99%.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2169266&req=5

Figure 3: Fraction of "cell-cycle" GO terms selected as a function of the p-value. The curves show the fraction of GO terms containing the word "cell-cycle" in their definition that are considered significant as a function of the significance cutoff (p). The red curve is obtained with ProbCD and all others are obtained with one of the probability cutoffs: 50%, 70%, 95%, 99% or 99.99%.

Mentions: Figure 3 shows that ProbCD considers more terms (vertical axis in Figure 3) containing the word "cell cycle", likely associated to periodically expressed genes, as significant if compared to the usual enrichment analysis in a wide range of significance values (p in Figure 3). Although this is not a proof, since one cannot be certain about which "cell cycle"-marked terms should be enriched, this is a reasonable indication that one can, in fact, avoid the discretization step when building the enrichment problem using ProbCD and obtain meaningful results.


ProbCD: enrichment analysis accounting for categorization uncertainty.

VĂȘncio RZ, Shmulevich I - BMC Bioinformatics (2007)

Fraction of "cell-cycle" GO terms selected as a function of the p-value. The curves show the fraction of GO terms containing the word "cell-cycle" in their definition that are considered significant as a function of the significance cutoff (p). The red curve is obtained with ProbCD and all others are obtained with one of the probability cutoffs: 50%, 70%, 95%, 99% or 99.99%.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2169266&req=5

Figure 3: Fraction of "cell-cycle" GO terms selected as a function of the p-value. The curves show the fraction of GO terms containing the word "cell-cycle" in their definition that are considered significant as a function of the significance cutoff (p). The red curve is obtained with ProbCD and all others are obtained with one of the probability cutoffs: 50%, 70%, 95%, 99% or 99.99%.
Mentions: Figure 3 shows that ProbCD considers more terms (vertical axis in Figure 3) containing the word "cell cycle", likely associated to periodically expressed genes, as significant if compared to the usual enrichment analysis in a wide range of significance values (p in Figure 3). Although this is not a proof, since one cannot be certain about which "cell cycle"-marked terms should be enriched, this is a reasonable indication that one can, in fact, avoid the discretization step when building the enrichment problem using ProbCD and obtain meaningful results.

Bottom Line: We developed an open-source R-based software to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table.The contingency table for the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities.In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Systems Biology, 1441 North 34th street, Seattle, WA 98103-8904, USA. rvencio@gmail.com

ABSTRACT

Background: As in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. In spite of efforts to create probabilistic annotations, especially in the Gene Ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the Fisher Exact Test.

Results: We developed an open-source R-based software to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table. The contingency table for the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities. An on-line interface was created to allow usage by non-programmers and is available at: http://xerad.systemsbiology.net/ProbCD/.

Conclusion: We present an analysis framework and software tools to address the issue of uncertainty in categorical data analysis. In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation.

Show MeSH