Limits...
Population structure and eigenanalysis.

Patterson N, Price AL, Reich D - PLoS Genet. (2006)

Bottom Line: Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation.We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests.This means that we can predict the dataset size needed to detect structure.

View Article: PubMed Central - PubMed

Affiliation: Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America.

ABSTRACT
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general "phase change" phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.

Show MeSH
Testing the Fit of the TW Distribution(A) We carried out 1,000 simulations of a panmictic population, where we have a sample size of m = 100 and n = 5,000 unlinked markers. We give a P–P plot of the TW statistic against the theoretical distribution; this shows the empirical cumulative distribution against the theoretical cumulative distribution for a given quantile. If the fit is good, we expect the plot will lie along the line y = x. Interest is primarily at the top right, corresponding to low p-values.(B) P–P plot corresponding to a sample size of m = 200 and n = 50,000 markers. The fit is again excellent, demonstrating the appropriateness of the Johnstone normalization.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1713260&req=5

pgen-0020190-g002: Testing the Fit of the TW Distribution(A) We carried out 1,000 simulations of a panmictic population, where we have a sample size of m = 100 and n = 5,000 unlinked markers. We give a P–P plot of the TW statistic against the theoretical distribution; this shows the empirical cumulative distribution against the theoretical cumulative distribution for a given quantile. If the fit is good, we expect the plot will lie along the line y = x. Interest is primarily at the top right, corresponding to low p-values.(B) P–P plot corresponding to a sample size of m = 200 and n = 50,000 markers. The fit is again excellent, demonstrating the appropriateness of the Johnstone normalization.

Mentions: We first made a series of simulations in the absence of population structure. (Some additional details are in the Methods section.) Our first set of runs had 100 individuals and 5,000 unlinked SNPs, and the second 200 individuals and 50,000 unlinked SNPs. In each case we ran 1,000 simulations and show in Figure 2A and 2B probability–probability (P–P) plots of the empirical and TW tail areas. The results seem entirely satisfactory, especially for low p-values in the top right of Figures 2A and 2B. For assessment of statistical significance, it is the low p-value range that is relevant. The simulations show more generally that the TW theory is relevant in a genetic context, that the normalizations of Equations 5–7 are appropriate, and that the calculation of the effective marker size has not seriously distorted the TW statistic.


Population structure and eigenanalysis.

Patterson N, Price AL, Reich D - PLoS Genet. (2006)

Testing the Fit of the TW Distribution(A) We carried out 1,000 simulations of a panmictic population, where we have a sample size of m = 100 and n = 5,000 unlinked markers. We give a P–P plot of the TW statistic against the theoretical distribution; this shows the empirical cumulative distribution against the theoretical cumulative distribution for a given quantile. If the fit is good, we expect the plot will lie along the line y = x. Interest is primarily at the top right, corresponding to low p-values.(B) P–P plot corresponding to a sample size of m = 200 and n = 50,000 markers. The fit is again excellent, demonstrating the appropriateness of the Johnstone normalization.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1713260&req=5

pgen-0020190-g002: Testing the Fit of the TW Distribution(A) We carried out 1,000 simulations of a panmictic population, where we have a sample size of m = 100 and n = 5,000 unlinked markers. We give a P–P plot of the TW statistic against the theoretical distribution; this shows the empirical cumulative distribution against the theoretical cumulative distribution for a given quantile. If the fit is good, we expect the plot will lie along the line y = x. Interest is primarily at the top right, corresponding to low p-values.(B) P–P plot corresponding to a sample size of m = 200 and n = 50,000 markers. The fit is again excellent, demonstrating the appropriateness of the Johnstone normalization.
Mentions: We first made a series of simulations in the absence of population structure. (Some additional details are in the Methods section.) Our first set of runs had 100 individuals and 5,000 unlinked SNPs, and the second 200 individuals and 50,000 unlinked SNPs. In each case we ran 1,000 simulations and show in Figure 2A and 2B probability–probability (P–P) plots of the empirical and TW tail areas. The results seem entirely satisfactory, especially for low p-values in the top right of Figures 2A and 2B. For assessment of statistical significance, it is the low p-value range that is relevant. The simulations show more generally that the TW theory is relevant in a genetic context, that the normalizations of Equations 5–7 are appropriate, and that the calculation of the effective marker size has not seriously distorted the TW statistic.

Bottom Line: Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation.We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests.This means that we can predict the dataset size needed to detect structure.

View Article: PubMed Central - PubMed

Affiliation: Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America.

ABSTRACT
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general "phase change" phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.

Show MeSH