Statistical significance of variables driving systematic variation in high-dimensional data.
Bottom Line: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs.We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype.An R software package, called jackstraw, is available in CRAN. firstname.lastname@example.org.
Affiliation: Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.Show MeSH
Mentions: For a given simulated dataset, we tested for the associations between the observed variables and the latent variables by forming association statistics between the observed and their collective PC, (r = 1). We calculated P values using both the conventional F test and the proposed method with s = 50 synthetic variables (Fig. 5). Over 500 simulated datasets, the conventional F test resulted in 500 one-sided KS P values that exhibit a strong anti-conservative bias with a double KS P value of (Supplementary Fig. S4, black points). Conversely, the proposed method correctly calculates P values, by accounting for the over-fitted measurement error in PCA, with a double KS P value of 0.502 (Supplementary Fig. S4, orange points). Alternatively, a comparison of estimated versus true FDR demonstrates an appropriate adjustment for over-fitting in the jackstraw method (Supplementary Fig. S5). Note that the classification of P values is based on the true association status from the population-level data generating distribution from model (1), not based on model (2) or on the observed loadings from the PCA.Fig. 5.
Affiliation: Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.