Limits...
Statistical significance of variables driving systematic variation in high-dimensional data.

Chung NC, Storey JD - Bioinformatics (2014)

Bottom Line: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs.The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables.Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype.

View Article: PubMed Central - PubMed

Affiliation: Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.

Show MeSH
A schematic of the general steps of the proposed algorithm to calculate the statistical significance of associations between variables (rows in Y) and their top r PCs (). By independently permuting a small number (s) of variables and recalculating the PCs, we generate tractable “synthetic”  variables while preserving the overall systematic variation. Association statistics between the s synthetic  variables in  and  form the empirical  distribution, automatically taking account over-fitting intrinsic to testing for associations between a set of observed variables and their PCs
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4325543&req=5

btu674-F3: A schematic of the general steps of the proposed algorithm to calculate the statistical significance of associations between variables (rows in Y) and their top r PCs (). By independently permuting a small number (s) of variables and recalculating the PCs, we generate tractable “synthetic” variables while preserving the overall systematic variation. Association statistics between the s synthetic variables in and form the empirical distribution, automatically taking account over-fitting intrinsic to testing for associations between a set of observed variables and their PCs

Mentions: We have developed a resampling method (Fig. 3) to obtain accurate statistical significance measures of the associations between observed variables and their PCs, accounting for the over-fitting characteristics due to computation of PCs from the same set of observed variables. The proposed algorithm replaces a small number s () of observed variables with independently permuted ‘synthetic’ variables, while preserving the overall systematic variation in the data. Note that the jackstraw disrupts the systematic variation among the randomly chosen s rows by applying independently generated permutation mappings. We denote the new matrix with the s synthetic variables replacing their original values as . This is simply the original matrix Y with the s rows of Y replaced by independently permuted versions. On each permutation dataset , we calculate association statistics for each synthetic variable, exactly as was done on the original data. We carry this out B times, effectively creating B sets of permutation statistics. The association statistics calculated on Y are then compared to the association statistics calculated on only the s synthetic rows of to obtain statistical significance measures.Fig. 3.


Statistical significance of variables driving systematic variation in high-dimensional data.

Chung NC, Storey JD - Bioinformatics (2014)

A schematic of the general steps of the proposed algorithm to calculate the statistical significance of associations between variables (rows in Y) and their top r PCs (). By independently permuting a small number (s) of variables and recalculating the PCs, we generate tractable “synthetic”  variables while preserving the overall systematic variation. Association statistics between the s synthetic  variables in  and  form the empirical  distribution, automatically taking account over-fitting intrinsic to testing for associations between a set of observed variables and their PCs
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4325543&req=5

btu674-F3: A schematic of the general steps of the proposed algorithm to calculate the statistical significance of associations between variables (rows in Y) and their top r PCs (). By independently permuting a small number (s) of variables and recalculating the PCs, we generate tractable “synthetic” variables while preserving the overall systematic variation. Association statistics between the s synthetic variables in and form the empirical distribution, automatically taking account over-fitting intrinsic to testing for associations between a set of observed variables and their PCs
Mentions: We have developed a resampling method (Fig. 3) to obtain accurate statistical significance measures of the associations between observed variables and their PCs, accounting for the over-fitting characteristics due to computation of PCs from the same set of observed variables. The proposed algorithm replaces a small number s () of observed variables with independently permuted ‘synthetic’ variables, while preserving the overall systematic variation in the data. Note that the jackstraw disrupts the systematic variation among the randomly chosen s rows by applying independently generated permutation mappings. We denote the new matrix with the s synthetic variables replacing their original values as . This is simply the original matrix Y with the s rows of Y replaced by independently permuted versions. On each permutation dataset , we calculate association statistics for each synthetic variable, exactly as was done on the original data. We carry this out B times, effectively creating B sets of permutation statistics. The association statistics calculated on Y are then compared to the association statistics calculated on only the s synthetic rows of to obtain statistical significance measures.Fig. 3.

Bottom Line: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs.The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables.Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype.

View Article: PubMed Central - PubMed

Affiliation: Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.

Show MeSH