Limits...
Quantitative set analysis for gene expression: a method to quantify gene set differential expression including gene-gene correlations.

Yaari G, Bolen CR, Thakar J, Kleinstein SH - Nucleic Acids Res. (2013)

Bottom Line: Existing tests are affected by inter-gene correlations, resulting in a high Type I error.From this probability density function, P-values and confidence intervals can be extracted and post hoc analysis can be carried out while maintaining statistical traceability.QuSAGE is available as an R package, which includes the core functions for the method as well as functions to plot and visualize the results.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Yale University School of Medicine, New Haven, CT 06511, USA, Bioengineering program, Faculty of engineering, Bar Ilan University, 5290002, Ramat Gan, Israel and Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA.

ABSTRACT
Enrichment analysis of gene sets is a popular approach that provides a functional interpretation of genome-wide expression data. Existing tests are affected by inter-gene correlations, resulting in a high Type I error. The most widely used test, Gene Set Enrichment Analysis, relies on computationally intensive permutations of sample labels to generate a distribution that preserves gene-gene correlations. A more recent approach, CAMERA, attempts to correct for these correlations by estimating a variance inflation factor directly from the data. Although these methods generate P-values for detecting gene set activity, they are unable to produce confidence intervals or allow for post hoc comparisons. We have developed a new computational framework for Quantitative Set Analysis of Gene Expression (QuSAGE). QuSAGE accounts for inter-gene correlations, improves the estimation of the variance inflation factor and, rather than evaluating the deviation from a hypothesis with a P-value, it quantifies gene-set activity with a complete probability density function. From this probability density function, P-values and confidence intervals can be extracted and post hoc analysis can be carried out while maintaining statistical traceability. Compared with Gene Set Enrichment Analysis and CAMERA, QuSAGE exhibits better sensitivity and specificity on real data profiling the response to interferon therapy (in chronic Hepatitis C virus patients) and Influenza A virus infection. QuSAGE is available as an R package, which includes the core functions for the method as well as functions to plot and visualize the results.

Show MeSH

Related in: MedlinePlus

The impact of unequal gene expression variance across groups. (A) The standard deviation of individual gene expression values (points) was calculated using data from pre-therapy PBMC samples in a study of chronic HCV infection (18). Samples were divided into two groups depending on the clinical response to therapy, and separate standard deviations were calculated for each group. Equality is indicated by the dashed line. (B) ROC curves based on stochastic simulations (see text) for testing the difference between two groups using Welch’s approximation (black line) or the pooled variance approach (red line). The parameters for the stochastic simulations were based on the empirical data [indicated by a white x in (A)]. The X and 0 indicate the values for which .
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3794608&req=5

gkt660-F2: The impact of unequal gene expression variance across groups. (A) The standard deviation of individual gene expression values (points) was calculated using data from pre-therapy PBMC samples in a study of chronic HCV infection (18). Samples were divided into two groups depending on the clinical response to therapy, and separate standard deviations were calculated for each group. Equality is indicated by the dashed line. (B) ROC curves based on stochastic simulations (see text) for testing the difference between two groups using Welch’s approximation (black line) or the pooled variance approach (red line). The parameters for the stochastic simulations were based on the empirical data [indicated by a white x in (A)]. The X and 0 indicate the values for which .

Mentions: The pooled variance approach can slightly improve sensitivity (if the equal variance assumption holds) and is easily compatible with linear models and Analysis of variance (ANOVA). However, this approach can be extremely biased when the assumption of equal variances is broken, which we find is often the case in many real gene expression data sets. To illustrate this fact, we turn to a series of clinical studies on the response of chronic HCV patients to IFN therapy, which we use as running examples throughout this article. Figure 2A plots the estimated standard deviation of 12 718 genes that were measured in one of these studies before the initiation of IFN therapy (18). In this case, the patients were classified by their clinical response to therapy, and it is clear that the standard deviation for most genes is higher in strong responders. To demonstrate the impact of these unequal variances, the sensitivity and specificity of both approaches were estimated using stochastic simulations based on the actual sample sizes () and standard deviations () from an example gene (the X in Figure 2A). Specificity (1-false-positive rate) was calculated by sampling two groups from normal distributions with the same mean (), whereas the sensitivity (true-positive rate) was calculated by sampling similar distribution with different means (). The results for each approach (Pooled variance and Welch) are summarized as receivers operating characteristic curves (Figure 2B). Although the two receivers operating characteristic curves lie on top of each other, the desired specificity (Type I error) for the pooled approach is biased leading to a significantly higher false-positive rate than the α level. Thus, we recommend the use of the Welch formalism for most cases, as there is little benefit, and significant potential disadvantages, to assuming that the variance of the ‘treatment’ measurements will be similar to the ‘control’.Figure 2.


Quantitative set analysis for gene expression: a method to quantify gene set differential expression including gene-gene correlations.

Yaari G, Bolen CR, Thakar J, Kleinstein SH - Nucleic Acids Res. (2013)

The impact of unequal gene expression variance across groups. (A) The standard deviation of individual gene expression values (points) was calculated using data from pre-therapy PBMC samples in a study of chronic HCV infection (18). Samples were divided into two groups depending on the clinical response to therapy, and separate standard deviations were calculated for each group. Equality is indicated by the dashed line. (B) ROC curves based on stochastic simulations (see text) for testing the difference between two groups using Welch’s approximation (black line) or the pooled variance approach (red line). The parameters for the stochastic simulations were based on the empirical data [indicated by a white x in (A)]. The X and 0 indicate the values for which .
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3794608&req=5

gkt660-F2: The impact of unequal gene expression variance across groups. (A) The standard deviation of individual gene expression values (points) was calculated using data from pre-therapy PBMC samples in a study of chronic HCV infection (18). Samples were divided into two groups depending on the clinical response to therapy, and separate standard deviations were calculated for each group. Equality is indicated by the dashed line. (B) ROC curves based on stochastic simulations (see text) for testing the difference between two groups using Welch’s approximation (black line) or the pooled variance approach (red line). The parameters for the stochastic simulations were based on the empirical data [indicated by a white x in (A)]. The X and 0 indicate the values for which .
Mentions: The pooled variance approach can slightly improve sensitivity (if the equal variance assumption holds) and is easily compatible with linear models and Analysis of variance (ANOVA). However, this approach can be extremely biased when the assumption of equal variances is broken, which we find is often the case in many real gene expression data sets. To illustrate this fact, we turn to a series of clinical studies on the response of chronic HCV patients to IFN therapy, which we use as running examples throughout this article. Figure 2A plots the estimated standard deviation of 12 718 genes that were measured in one of these studies before the initiation of IFN therapy (18). In this case, the patients were classified by their clinical response to therapy, and it is clear that the standard deviation for most genes is higher in strong responders. To demonstrate the impact of these unequal variances, the sensitivity and specificity of both approaches were estimated using stochastic simulations based on the actual sample sizes () and standard deviations () from an example gene (the X in Figure 2A). Specificity (1-false-positive rate) was calculated by sampling two groups from normal distributions with the same mean (), whereas the sensitivity (true-positive rate) was calculated by sampling similar distribution with different means (). The results for each approach (Pooled variance and Welch) are summarized as receivers operating characteristic curves (Figure 2B). Although the two receivers operating characteristic curves lie on top of each other, the desired specificity (Type I error) for the pooled approach is biased leading to a significantly higher false-positive rate than the α level. Thus, we recommend the use of the Welch formalism for most cases, as there is little benefit, and significant potential disadvantages, to assuming that the variance of the ‘treatment’ measurements will be similar to the ‘control’.Figure 2.

Bottom Line: Existing tests are affected by inter-gene correlations, resulting in a high Type I error.From this probability density function, P-values and confidence intervals can be extracted and post hoc analysis can be carried out while maintaining statistical traceability.QuSAGE is available as an R package, which includes the core functions for the method as well as functions to plot and visualize the results.

View Article: PubMed Central - PubMed

Affiliation: Department of Pathology, Yale University School of Medicine, New Haven, CT 06511, USA, Bioengineering program, Faculty of engineering, Bar Ilan University, 5290002, Ramat Gan, Israel and Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA.

ABSTRACT
Enrichment analysis of gene sets is a popular approach that provides a functional interpretation of genome-wide expression data. Existing tests are affected by inter-gene correlations, resulting in a high Type I error. The most widely used test, Gene Set Enrichment Analysis, relies on computationally intensive permutations of sample labels to generate a distribution that preserves gene-gene correlations. A more recent approach, CAMERA, attempts to correct for these correlations by estimating a variance inflation factor directly from the data. Although these methods generate P-values for detecting gene set activity, they are unable to produce confidence intervals or allow for post hoc comparisons. We have developed a new computational framework for Quantitative Set Analysis of Gene Expression (QuSAGE). QuSAGE accounts for inter-gene correlations, improves the estimation of the variance inflation factor and, rather than evaluating the deviation from a hypothesis with a P-value, it quantifies gene-set activity with a complete probability density function. From this probability density function, P-values and confidence intervals can be extracted and post hoc analysis can be carried out while maintaining statistical traceability. Compared with Gene Set Enrichment Analysis and CAMERA, QuSAGE exhibits better sensitivity and specificity on real data profiling the response to interferon therapy (in chronic Hepatitis C virus patients) and Influenza A virus infection. QuSAGE is available as an R package, which includes the core functions for the method as well as functions to plot and visualize the results.

Show MeSH
Related in: MedlinePlus