Limits...
ProbCD: enrichment analysis accounting for categorization uncertainty.

Vêncio RZ, Shmulevich I - BMC Bioinformatics (2007)

Bottom Line: We developed an open-source R-based software to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table.The contingency table for the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities.In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Systems Biology, 1441 North 34th street, Seattle, WA 98103-8904, USA. rvencio@gmail.com

ABSTRACT

Background: As in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. In spite of efforts to create probabilistic annotations, especially in the Gene Ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the Fisher Exact Test.

Results: We developed an open-source R-based software to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table. The contingency table for the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities. An on-line interface was created to allow usage by non-programmers and is available at: http://xerad.systemsbiology.net/ProbCD/.

Conclusion: We present an analysis framework and software tools to address the issue of uncertainty in categorical data analysis. In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation.

Show MeSH
Probability of being periodic. The blue curve represents the probability of a gene being periodic (Pr) according to the model of [22]. The genes are sorted by probability values (rank) on the horizontal axis to facilitate the visualization. The red curve is the deterministic approximation using a 70% probability cutoff to consider a gene as periodic: ℙ (genei is periodic) ≥ 0.70 ⇒ ℙ (genei is periodic) = 1 and ℙ (genei is periodic) < 0.70 ⇒ ℙ (genei is periodic) = 0. This approximation labels 15% of the genes as periodic.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2169266&req=5

Figure 1: Probability of being periodic. The blue curve represents the probability of a gene being periodic (Pr) according to the model of [22]. The genes are sorted by probability values (rank) on the horizontal axis to facilitate the visualization. The red curve is the deterministic approximation using a 70% probability cutoff to consider a gene as periodic: ℙ (genei is periodic) ≥ 0.70 ⇒ ℙ (genei is periodic) = 1 and ℙ (genei is periodic) < 0.70 ⇒ ℙ (genei is periodic) = 0. This approximation labels 15% of the genes as periodic.

Mentions: Suspecting that this non-intuitive result could be due to the probability threshold chosen to select periodic genes, illustrated in the Figure 1, one could repeat the same analysis above building the contingency table considering the cutoffs ℙ (genei is periodic) ≥ 50%, 95%, 99% or 99.99%. The result of this repeated analysis is also non-intuitive since the p-values are: 0.12, 1.0, 1.0 and 1.0 for 50%, 95%, 99% and 99.99% cutoffs, respectively, meaning that increasing the stringency to define a gene as periodic only decreases the significance of the enrichment for GO:0007090.


ProbCD: enrichment analysis accounting for categorization uncertainty.

Vêncio RZ, Shmulevich I - BMC Bioinformatics (2007)

Probability of being periodic. The blue curve represents the probability of a gene being periodic (Pr) according to the model of [22]. The genes are sorted by probability values (rank) on the horizontal axis to facilitate the visualization. The red curve is the deterministic approximation using a 70% probability cutoff to consider a gene as periodic: ℙ (genei is periodic) ≥ 0.70 ⇒ ℙ (genei is periodic) = 1 and ℙ (genei is periodic) < 0.70 ⇒ ℙ (genei is periodic) = 0. This approximation labels 15% of the genes as periodic.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2169266&req=5

Figure 1: Probability of being periodic. The blue curve represents the probability of a gene being periodic (Pr) according to the model of [22]. The genes are sorted by probability values (rank) on the horizontal axis to facilitate the visualization. The red curve is the deterministic approximation using a 70% probability cutoff to consider a gene as periodic: ℙ (genei is periodic) ≥ 0.70 ⇒ ℙ (genei is periodic) = 1 and ℙ (genei is periodic) < 0.70 ⇒ ℙ (genei is periodic) = 0. This approximation labels 15% of the genes as periodic.
Mentions: Suspecting that this non-intuitive result could be due to the probability threshold chosen to select periodic genes, illustrated in the Figure 1, one could repeat the same analysis above building the contingency table considering the cutoffs ℙ (genei is periodic) ≥ 50%, 95%, 99% or 99.99%. The result of this repeated analysis is also non-intuitive since the p-values are: 0.12, 1.0, 1.0 and 1.0 for 50%, 95%, 99% and 99.99% cutoffs, respectively, meaning that increasing the stringency to define a gene as periodic only decreases the significance of the enrichment for GO:0007090.

Bottom Line: We developed an open-source R-based software to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table.The contingency table for the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities.In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Systems Biology, 1441 North 34th street, Seattle, WA 98103-8904, USA. rvencio@gmail.com

ABSTRACT

Background: As in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. In spite of efforts to create probabilistic annotations, especially in the Gene Ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the Fisher Exact Test.

Results: We developed an open-source R-based software to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table. The contingency table for the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities. An on-line interface was created to allow usage by non-programmers and is available at: http://xerad.systemsbiology.net/ProbCD/.

Conclusion: We present an analysis framework and software tools to address the issue of uncertainty in categorical data analysis. In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation.

Show MeSH