Limits...
Population structure and eigenanalysis.

Patterson N, Price AL, Reich D - PLoS Genet. (2006)

Bottom Line: Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation.We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests.This means that we can predict the dataset size needed to detect structure.

View Article: PubMed Central - PubMed

Affiliation: Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America.

ABSTRACT
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general "phase change" phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.

Show MeSH
Testing the Fit of the Second EigenvalueWe generated genotype data in which the leading eigenvalue is overwhelmingly significant (FST = .01, m = 100, n = 5,000) with two equal-sized subpopulations. We show P–P plots for the TW statistic computed from the second eigenvalue. The fit at the high end is excellent.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1713260&req=5

pgen-0020190-g003: Testing the Fit of the Second EigenvalueWe generated genotype data in which the leading eigenvalue is overwhelmingly significant (FST = .01, m = 100, n = 5,000) with two equal-sized subpopulations. We show P–P plots for the TW statistic computed from the second eigenvalue. The fit at the high end is excellent.

Mentions: It is very important to be able to answer the question: Does the data show evidence of additional population structure over and above what has already been detected? The test we propose is extremely simple. If our matrix X has eigenvalues and we already have declared the top k eigenvalues to be significant, then we simply test λk+1,…λm as though X was a (m′ − k) × (m′ − k) Wishart matrix. Johnstone shows [22, Proposition 1.2] that this procedure is conservative, at least for a true Wishart matrix. We tested this by generating data in which there is one eigenvalue that is overwhelmingly significant, and examined the distribution of the second eigenvalue. As shown by the P–P plot of Figure 3, the fit is again very good, especially for small p-values. If an eigenvalue is not significant, then further testing of smaller eigenvalues should not be done.


Population structure and eigenanalysis.

Patterson N, Price AL, Reich D - PLoS Genet. (2006)

Testing the Fit of the Second EigenvalueWe generated genotype data in which the leading eigenvalue is overwhelmingly significant (FST = .01, m = 100, n = 5,000) with two equal-sized subpopulations. We show P–P plots for the TW statistic computed from the second eigenvalue. The fit at the high end is excellent.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1713260&req=5

pgen-0020190-g003: Testing the Fit of the Second EigenvalueWe generated genotype data in which the leading eigenvalue is overwhelmingly significant (FST = .01, m = 100, n = 5,000) with two equal-sized subpopulations. We show P–P plots for the TW statistic computed from the second eigenvalue. The fit at the high end is excellent.
Mentions: It is very important to be able to answer the question: Does the data show evidence of additional population structure over and above what has already been detected? The test we propose is extremely simple. If our matrix X has eigenvalues and we already have declared the top k eigenvalues to be significant, then we simply test λk+1,…λm as though X was a (m′ − k) × (m′ − k) Wishart matrix. Johnstone shows [22, Proposition 1.2] that this procedure is conservative, at least for a true Wishart matrix. We tested this by generating data in which there is one eigenvalue that is overwhelmingly significant, and examined the distribution of the second eigenvalue. As shown by the P–P plot of Figure 3, the fit is again very good, especially for small p-values. If an eigenvalue is not significant, then further testing of smaller eigenvalues should not be done.

Bottom Line: Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation.We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests.This means that we can predict the dataset size needed to detect structure.

View Article: PubMed Central - PubMed

Affiliation: Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America.

ABSTRACT
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general "phase change" phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.

Show MeSH