Limits...
Statistical significance of variables driving systematic variation in high-dimensional data.

Chung NC, Storey JD - Bioinformatics (2014)

Bottom Line: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs.The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables.Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype.

View Article: PubMed Central - PubMed

Affiliation: Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.

Show MeSH
Illustration of systematic variation genomic data due to latent variables. Complex biological variables, such as clinical subtypes and cell-cycle regulation, may be difficult to define, measure, or model. Instead, we can characterize the manifestation of latent variables, , directly from high-dimensional genomic data using PCA and related methods. The proposed method calculates the statistical significance of associations between variables in Y and estimates of L, while accounting for over-fitting due to the fact that L must be estimated from Y
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4325543&req=5

btu674-F1: Illustration of systematic variation genomic data due to latent variables. Complex biological variables, such as clinical subtypes and cell-cycle regulation, may be difficult to define, measure, or model. Instead, we can characterize the manifestation of latent variables, , directly from high-dimensional genomic data using PCA and related methods. The proposed method calculates the statistical significance of associations between variables in Y and estimates of L, while accounting for over-fitting due to the fact that L must be estimated from Y

Mentions: Latent variable models play an important role in understanding variation in genomic data (Leek and Storey, 2007; Price et al., 2006). They are particularly useful for characterizing systematic variation in genomic data whose variable representation is unobserved or imprecisely known (Fig. 1). Principal component analysis (PCA) has proven to be an especially informative method for capturing quantitative signatures of latent variables in genomic data, and it is in widespread use across a range of applications. For example, PCA has been successfully applied to uncover the systematic variation in gene expression (Alter et al., 2000; Holter et al., 2000; Raychaudhuri et al., 2000), estimate structure in population genetics (Price et al., 2006; Zhu et al., 2002), and account for dependence in multiple hypothesis testing (Leek and Storey, 2007, 2008). Generally, principal components (PCs) can be thought of as estimates of unobserved manifestation of latent variables; they are constructed by aggregating variation across thousands or more genomic variables (Jolliffe, 2002). What is missing from this highly successful system is a method to precisely identify which genomic variables are the statistically significant drivers of the PCs in genomic data, which in turn identifies the genomic variables associated with the unobserved latent variables.Fig. 1.


Statistical significance of variables driving systematic variation in high-dimensional data.

Chung NC, Storey JD - Bioinformatics (2014)

Illustration of systematic variation genomic data due to latent variables. Complex biological variables, such as clinical subtypes and cell-cycle regulation, may be difficult to define, measure, or model. Instead, we can characterize the manifestation of latent variables, , directly from high-dimensional genomic data using PCA and related methods. The proposed method calculates the statistical significance of associations between variables in Y and estimates of L, while accounting for over-fitting due to the fact that L must be estimated from Y
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4325543&req=5

btu674-F1: Illustration of systematic variation genomic data due to latent variables. Complex biological variables, such as clinical subtypes and cell-cycle regulation, may be difficult to define, measure, or model. Instead, we can characterize the manifestation of latent variables, , directly from high-dimensional genomic data using PCA and related methods. The proposed method calculates the statistical significance of associations between variables in Y and estimates of L, while accounting for over-fitting due to the fact that L must be estimated from Y
Mentions: Latent variable models play an important role in understanding variation in genomic data (Leek and Storey, 2007; Price et al., 2006). They are particularly useful for characterizing systematic variation in genomic data whose variable representation is unobserved or imprecisely known (Fig. 1). Principal component analysis (PCA) has proven to be an especially informative method for capturing quantitative signatures of latent variables in genomic data, and it is in widespread use across a range of applications. For example, PCA has been successfully applied to uncover the systematic variation in gene expression (Alter et al., 2000; Holter et al., 2000; Raychaudhuri et al., 2000), estimate structure in population genetics (Price et al., 2006; Zhu et al., 2002), and account for dependence in multiple hypothesis testing (Leek and Storey, 2007, 2008). Generally, principal components (PCs) can be thought of as estimates of unobserved manifestation of latent variables; they are constructed by aggregating variation across thousands or more genomic variables (Jolliffe, 2002). What is missing from this highly successful system is a method to precisely identify which genomic variables are the statistically significant drivers of the PCs in genomic data, which in turn identifies the genomic variables associated with the unobserved latent variables.Fig. 1.

Bottom Line: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs.The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables.Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype.

View Article: PubMed Central - PubMed

Affiliation: Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.

Show MeSH