Limits...
Statistical significance of variables driving systematic variation in high-dimensional data.

Chung NC, Storey JD - Bioinformatics (2014)

Bottom Line: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs.The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables.Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype.

View Article: PubMed Central - PubMed

Affiliation: Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.

Show MeSH
Evaluation of significance measures of associations between variables and their PCs by comparing true  P values and the Uniform(0,1) distribution. (a) The conventional F-test results in anti-conservative P values, as demonstrated by  P values being skewed towards 0. (b) The proposed method produces  P values distributed Uniform(0,1). The dashed line shows the Uniform(0,1) density function
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4325543&req=5

btu674-F5: Evaluation of significance measures of associations between variables and their PCs by comparing true P values and the Uniform(0,1) distribution. (a) The conventional F-test results in anti-conservative P values, as demonstrated by P values being skewed towards 0. (b) The proposed method produces P values distributed Uniform(0,1). The dashed line shows the Uniform(0,1) density function

Mentions: For a given simulated dataset, we tested for the associations between the observed variables and the latent variables by forming association statistics between the observed and their collective PC, (r = 1). We calculated P values using both the conventional F test and the proposed method with s = 50 synthetic variables (Fig. 5). Over 500 simulated datasets, the conventional F test resulted in 500 one-sided KS P values that exhibit a strong anti-conservative bias with a double KS P value of (Supplementary Fig. S4, black points). Conversely, the proposed method correctly calculates P values, by accounting for the over-fitted measurement error in PCA, with a double KS P value of 0.502 (Supplementary Fig. S4, orange points). Alternatively, a comparison of estimated versus true FDR demonstrates an appropriate adjustment for over-fitting in the jackstraw method (Supplementary Fig. S5). Note that the classification of P values is based on the true association status from the population-level data generating distribution from model (1), not based on model (2) or on the observed loadings from the PCA.Fig. 5.


Statistical significance of variables driving systematic variation in high-dimensional data.

Chung NC, Storey JD - Bioinformatics (2014)

Evaluation of significance measures of associations between variables and their PCs by comparing true  P values and the Uniform(0,1) distribution. (a) The conventional F-test results in anti-conservative P values, as demonstrated by  P values being skewed towards 0. (b) The proposed method produces  P values distributed Uniform(0,1). The dashed line shows the Uniform(0,1) density function
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4325543&req=5

btu674-F5: Evaluation of significance measures of associations between variables and their PCs by comparing true P values and the Uniform(0,1) distribution. (a) The conventional F-test results in anti-conservative P values, as demonstrated by P values being skewed towards 0. (b) The proposed method produces P values distributed Uniform(0,1). The dashed line shows the Uniform(0,1) density function
Mentions: For a given simulated dataset, we tested for the associations between the observed variables and the latent variables by forming association statistics between the observed and their collective PC, (r = 1). We calculated P values using both the conventional F test and the proposed method with s = 50 synthetic variables (Fig. 5). Over 500 simulated datasets, the conventional F test resulted in 500 one-sided KS P values that exhibit a strong anti-conservative bias with a double KS P value of (Supplementary Fig. S4, black points). Conversely, the proposed method correctly calculates P values, by accounting for the over-fitted measurement error in PCA, with a double KS P value of 0.502 (Supplementary Fig. S4, orange points). Alternatively, a comparison of estimated versus true FDR demonstrates an appropriate adjustment for over-fitting in the jackstraw method (Supplementary Fig. S5). Note that the classification of P values is based on the true association status from the population-level data generating distribution from model (1), not based on model (2) or on the observed loadings from the PCA.Fig. 5.

Bottom Line: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs.The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables.Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype.

View Article: PubMed Central - PubMed

Affiliation: Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.

Show MeSH