Limits...
Capturing heterogeneity in gene expression studies by surrogate variable analysis.

Leek JT, Storey JD - PLoS Genet. (2007)

Bottom Line: It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels.We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study.We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

View Article: PubMed Central - PubMed

Affiliation: Department of Biostatistics, University of Washington, Seattle, Washington, USA.

ABSTRACT
It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for well-designed, randomized studies. We introduce "surrogate variable analysis" (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

Show MeSH

Related in: MedlinePlus

Impact of Expression HeterogeneityOne thousand gene expression datasets containing EH were simulated, tested, and ranked for differential expression as detailed in Simulated Examples.(A) A boxplot of the standard deviation of the ranks of each gene for differential expression over repeated simulated studies. Results are shown for analyses that ignore expression heterogeneity (Unadjusted), take expression heterogeneity into account by SVA (Adjusted), and for simulated data unaffected by expression heterogeneity (Ideal).(B) For each simulated dataset, a Kolmogorov-Smirnov test was employed to assess whether the p-values of  genes followed the correct  Uniform distribution (Text S1). A quantile–quantile plot of the 1,000 Kolmogorov-Smirnov p-values are shown for the SVA-adjusted analysis (solid line) and the unadjusted analysis (dashed line). It can be seen that the SVA-adjusted analysis provides correctly distributed  p-values, whereas the unadjusted analysis does not due to EH.(C) A plot of expected true positives versus FDR for the SVA-adjusted (solid) and -unadjusted (dashed) analyses. The SVA-adjusted analysis shows increased power to detect true differential expression.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1994707&req=5

pgen-0030161-g001: Impact of Expression HeterogeneityOne thousand gene expression datasets containing EH were simulated, tested, and ranked for differential expression as detailed in Simulated Examples.(A) A boxplot of the standard deviation of the ranks of each gene for differential expression over repeated simulated studies. Results are shown for analyses that ignore expression heterogeneity (Unadjusted), take expression heterogeneity into account by SVA (Adjusted), and for simulated data unaffected by expression heterogeneity (Ideal).(B) For each simulated dataset, a Kolmogorov-Smirnov test was employed to assess whether the p-values of genes followed the correct Uniform distribution (Text S1). A quantile–quantile plot of the 1,000 Kolmogorov-Smirnov p-values are shown for the SVA-adjusted analysis (solid line) and the unadjusted analysis (dashed line). It can be seen that the SVA-adjusted analysis provides correctly distributed p-values, whereas the unadjusted analysis does not due to EH.(C) A plot of expected true positives versus FDR for the SVA-adjusted (solid) and -unadjusted (dashed) analyses. The SVA-adjusted analysis shows increased power to detect true differential expression.

Mentions: Here, we introduce “surrogate variable analysis” (SVA) to identify, estimate, and utilize the components of EH. Figure 1 shows the effects of failing to account for unmodeled factors in a differential expression analysis, and the potential benefits of the SVA approach. EH causes drastic increases in the variability of the ranking of genes for differential expression (Figure 1A), distorts the distribution potentially causing highly conservative or anticonservative significance estimates (Figure 1B), and reduces the power to distinguish true associations between a measured variable of interest and gene expression (Figure 1C). However, employing SVA in these studies produces operating characteristics nearly equivalent to what one would obtain with no EH at all.


Capturing heterogeneity in gene expression studies by surrogate variable analysis.

Leek JT, Storey JD - PLoS Genet. (2007)

Impact of Expression HeterogeneityOne thousand gene expression datasets containing EH were simulated, tested, and ranked for differential expression as detailed in Simulated Examples.(A) A boxplot of the standard deviation of the ranks of each gene for differential expression over repeated simulated studies. Results are shown for analyses that ignore expression heterogeneity (Unadjusted), take expression heterogeneity into account by SVA (Adjusted), and for simulated data unaffected by expression heterogeneity (Ideal).(B) For each simulated dataset, a Kolmogorov-Smirnov test was employed to assess whether the p-values of  genes followed the correct  Uniform distribution (Text S1). A quantile–quantile plot of the 1,000 Kolmogorov-Smirnov p-values are shown for the SVA-adjusted analysis (solid line) and the unadjusted analysis (dashed line). It can be seen that the SVA-adjusted analysis provides correctly distributed  p-values, whereas the unadjusted analysis does not due to EH.(C) A plot of expected true positives versus FDR for the SVA-adjusted (solid) and -unadjusted (dashed) analyses. The SVA-adjusted analysis shows increased power to detect true differential expression.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1994707&req=5

pgen-0030161-g001: Impact of Expression HeterogeneityOne thousand gene expression datasets containing EH were simulated, tested, and ranked for differential expression as detailed in Simulated Examples.(A) A boxplot of the standard deviation of the ranks of each gene for differential expression over repeated simulated studies. Results are shown for analyses that ignore expression heterogeneity (Unadjusted), take expression heterogeneity into account by SVA (Adjusted), and for simulated data unaffected by expression heterogeneity (Ideal).(B) For each simulated dataset, a Kolmogorov-Smirnov test was employed to assess whether the p-values of genes followed the correct Uniform distribution (Text S1). A quantile–quantile plot of the 1,000 Kolmogorov-Smirnov p-values are shown for the SVA-adjusted analysis (solid line) and the unadjusted analysis (dashed line). It can be seen that the SVA-adjusted analysis provides correctly distributed p-values, whereas the unadjusted analysis does not due to EH.(C) A plot of expected true positives versus FDR for the SVA-adjusted (solid) and -unadjusted (dashed) analyses. The SVA-adjusted analysis shows increased power to detect true differential expression.
Mentions: Here, we introduce “surrogate variable analysis” (SVA) to identify, estimate, and utilize the components of EH. Figure 1 shows the effects of failing to account for unmodeled factors in a differential expression analysis, and the potential benefits of the SVA approach. EH causes drastic increases in the variability of the ranking of genes for differential expression (Figure 1A), distorts the distribution potentially causing highly conservative or anticonservative significance estimates (Figure 1B), and reduces the power to distinguish true associations between a measured variable of interest and gene expression (Figure 1C). However, employing SVA in these studies produces operating characteristics nearly equivalent to what one would obtain with no EH at all.

Bottom Line: It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels.We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study.We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

View Article: PubMed Central - PubMed

Affiliation: Department of Biostatistics, University of Washington, Seattle, Washington, USA.

ABSTRACT
It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for well-designed, randomized studies. We introduce "surrogate variable analysis" (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

Show MeSH
Related in: MedlinePlus