Limits...
Statistical significance of variables driving systematic variation in high-dimensional data.

Chung NC, Storey JD - Bioinformatics (2014)

Bottom Line: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs.The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables.Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype.

View Article: PubMed Central - PubMed

Affiliation: Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.

Show MeSH
Sixteen simulation scenarios generated by combining four design factors. To assess the statistical accuracy of the conventional F-test and the proposed method, we simulated 500 independent studies for each scenario, and assessed statistical accuracy according to the “joint  criterion” (Leek and Storey, 2011). For the  scenarios, non- coefficients were set to either -1 or 1 with a probability of 0.5. For a given simulation study, a valid statistical testing procedure must yield a set of  P values that are jointly distributed Uniform(0,1). We use a KS test to identify deviations from the Uniform(0,1) distribution. Supplementary Material, Figure S3 provides a detailed overview of the evaluation pipeline
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4325543&req=5

btu674-F4: Sixteen simulation scenarios generated by combining four design factors. To assess the statistical accuracy of the conventional F-test and the proposed method, we simulated 500 independent studies for each scenario, and assessed statistical accuracy according to the “joint criterion” (Leek and Storey, 2011). For the scenarios, non- coefficients were set to either -1 or 1 with a probability of 0.5. For a given simulation study, a valid statistical testing procedure must yield a set of P values that are jointly distributed Uniform(0,1). We use a KS test to identify deviations from the Uniform(0,1) distribution. Supplementary Material, Figure S3 provides a detailed overview of the evaluation pipeline

Mentions: We constructed 16 simulation scenarios representing a wide range of configurations of signal and noise (Fig. 4), with 500 independent studies simulated from each. Let us first consider one of the simpler scenarios in detail. Model (1) is used to generate the data. In this particular scenario, we have m = 1000, n = 20, r = 1 andL=n−1n(1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1),a dichotomous mean shift resembling differential expression between the first 10 observations and the second 10 observations. (The factor is to give L unit variance.) For 95% of the variables, we set bi = 0, implying they are variables; we parameterize this proportion by . The other 50 non- variables were simulated such that Uniform(0,1). The noise terms are simulated as Normal(0,1). The data for variable i are thus simulated according to .Fig. 4.


Statistical significance of variables driving systematic variation in high-dimensional data.

Chung NC, Storey JD - Bioinformatics (2014)

Sixteen simulation scenarios generated by combining four design factors. To assess the statistical accuracy of the conventional F-test and the proposed method, we simulated 500 independent studies for each scenario, and assessed statistical accuracy according to the “joint  criterion” (Leek and Storey, 2011). For the  scenarios, non- coefficients were set to either -1 or 1 with a probability of 0.5. For a given simulation study, a valid statistical testing procedure must yield a set of  P values that are jointly distributed Uniform(0,1). We use a KS test to identify deviations from the Uniform(0,1) distribution. Supplementary Material, Figure S3 provides a detailed overview of the evaluation pipeline
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4325543&req=5

btu674-F4: Sixteen simulation scenarios generated by combining four design factors. To assess the statistical accuracy of the conventional F-test and the proposed method, we simulated 500 independent studies for each scenario, and assessed statistical accuracy according to the “joint criterion” (Leek and Storey, 2011). For the scenarios, non- coefficients were set to either -1 or 1 with a probability of 0.5. For a given simulation study, a valid statistical testing procedure must yield a set of P values that are jointly distributed Uniform(0,1). We use a KS test to identify deviations from the Uniform(0,1) distribution. Supplementary Material, Figure S3 provides a detailed overview of the evaluation pipeline
Mentions: We constructed 16 simulation scenarios representing a wide range of configurations of signal and noise (Fig. 4), with 500 independent studies simulated from each. Let us first consider one of the simpler scenarios in detail. Model (1) is used to generate the data. In this particular scenario, we have m = 1000, n = 20, r = 1 andL=n−1n(1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1),a dichotomous mean shift resembling differential expression between the first 10 observations and the second 10 observations. (The factor is to give L unit variance.) For 95% of the variables, we set bi = 0, implying they are variables; we parameterize this proportion by . The other 50 non- variables were simulated such that Uniform(0,1). The noise terms are simulated as Normal(0,1). The data for variable i are thus simulated according to .Fig. 4.

Bottom Line: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs.The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables.Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype.

View Article: PubMed Central - PubMed

Affiliation: Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.

Show MeSH