Limits...
Capturing heterogeneity in gene expression studies by surrogate variable analysis.

Leek JT, Storey JD - PLoS Genet. (2007)

Bottom Line: It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels.We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study.We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

View Article: PubMed Central - PubMed

Affiliation: Department of Biostatistics, University of Washington, Seattle, Washington, USA.

ABSTRACT
It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for well-designed, randomized studies. We introduce "surrogate variable analysis" (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

Show MeSH

Related in: MedlinePlus

Example of Expression Heterogeneity(A) A heatmap of a simulated microarray study consisting of 1,000 genes measured on 20 arrays.(B) Genes 1–300 in this simulated study are differentially expressed between two hypothetical treatment groups; here the two groups are shown as an indicator variable for each array.(C) Genes 201–500 in each simulated study are affected by an independent factor that causes EH. This factor is distinct from, but possibly correlated with, the group variable. Here, the factor is shown as a quantitative variable, but it could also be an indicator variable or some linear or nonlinear function of the covariates.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1994707&req=5

pgen-0030161-g002: Example of Expression Heterogeneity(A) A heatmap of a simulated microarray study consisting of 1,000 genes measured on 20 arrays.(B) Genes 1–300 in this simulated study are differentially expressed between two hypothetical treatment groups; here the two groups are shown as an indicator variable for each array.(C) Genes 201–500 in each simulated study are affected by an independent factor that causes EH. This factor is distinct from, but possibly correlated with, the group variable. Here, the factor is shown as a quantitative variable, but it could also be an indicator variable or some linear or nonlinear function of the covariates.

Mentions: We have developed an approach called surrogate variable analysis that appropriately borrows information across genes to estimate the large-scale effects of all unmodeled factors directly from the expression data. Figure 2A shows a simulated example of EH. The primary variable distinguishes the first ten arrays from the last ten (Figure 2B); however, the unmodeled factor may have a variety of effects on expression (Figure 2C). The SVA approach flexibly captures signatures of EH, including highly irregular patterns not following any simple model, by estimating the signatures of EH in the expression data themselves rather than attempting to estimate specific unmodeled factors such as age or gender. After the surrogate variables are constructed, they are then incorporated into any subsequent analysis as covariates in the usual way. The SVA algorithm, described in mathematical detail in Materials and Methods, can conceptually be broken down into four basic steps: (Step 1) Remove the signal due to the primary variable(s) of interest to obtain a residual expression matrix. Apply a decomposition to the residual expression matrix to identify signatures of EH in terms of an orthogonal basis of singular vectors that completely reproduces these signatures. Use a statistical test to determine the singular vectors that represent significantly more variation than would be expected by chance. (Step 2) Identify the subset of genes driving each orthogonal signature of EH through a significance analysis of associations between the genes and the EH signatures on the residual expression matrix. (Step 3) For each subset of genes, build a surrogate variable based on the full EH signature of that subset in the original expression data. (Step 4) Include all significant surrogate variables as covariates in subsequent regression analyses, allowing for gene-specific coefficients for each surrogate variable.


Capturing heterogeneity in gene expression studies by surrogate variable analysis.

Leek JT, Storey JD - PLoS Genet. (2007)

Example of Expression Heterogeneity(A) A heatmap of a simulated microarray study consisting of 1,000 genes measured on 20 arrays.(B) Genes 1–300 in this simulated study are differentially expressed between two hypothetical treatment groups; here the two groups are shown as an indicator variable for each array.(C) Genes 201–500 in each simulated study are affected by an independent factor that causes EH. This factor is distinct from, but possibly correlated with, the group variable. Here, the factor is shown as a quantitative variable, but it could also be an indicator variable or some linear or nonlinear function of the covariates.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1994707&req=5

pgen-0030161-g002: Example of Expression Heterogeneity(A) A heatmap of a simulated microarray study consisting of 1,000 genes measured on 20 arrays.(B) Genes 1–300 in this simulated study are differentially expressed between two hypothetical treatment groups; here the two groups are shown as an indicator variable for each array.(C) Genes 201–500 in each simulated study are affected by an independent factor that causes EH. This factor is distinct from, but possibly correlated with, the group variable. Here, the factor is shown as a quantitative variable, but it could also be an indicator variable or some linear or nonlinear function of the covariates.
Mentions: We have developed an approach called surrogate variable analysis that appropriately borrows information across genes to estimate the large-scale effects of all unmodeled factors directly from the expression data. Figure 2A shows a simulated example of EH. The primary variable distinguishes the first ten arrays from the last ten (Figure 2B); however, the unmodeled factor may have a variety of effects on expression (Figure 2C). The SVA approach flexibly captures signatures of EH, including highly irregular patterns not following any simple model, by estimating the signatures of EH in the expression data themselves rather than attempting to estimate specific unmodeled factors such as age or gender. After the surrogate variables are constructed, they are then incorporated into any subsequent analysis as covariates in the usual way. The SVA algorithm, described in mathematical detail in Materials and Methods, can conceptually be broken down into four basic steps: (Step 1) Remove the signal due to the primary variable(s) of interest to obtain a residual expression matrix. Apply a decomposition to the residual expression matrix to identify signatures of EH in terms of an orthogonal basis of singular vectors that completely reproduces these signatures. Use a statistical test to determine the singular vectors that represent significantly more variation than would be expected by chance. (Step 2) Identify the subset of genes driving each orthogonal signature of EH through a significance analysis of associations between the genes and the EH signatures on the residual expression matrix. (Step 3) For each subset of genes, build a surrogate variable based on the full EH signature of that subset in the original expression data. (Step 4) Include all significant surrogate variables as covariates in subsequent regression analyses, allowing for gene-specific coefficients for each surrogate variable.

Bottom Line: It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels.We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study.We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

View Article: PubMed Central - PubMed

Affiliation: Department of Biostatistics, University of Washington, Seattle, Washington, USA.

ABSTRACT
It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for well-designed, randomized studies. We introduce "surrogate variable analysis" (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

Show MeSH
Related in: MedlinePlus