Population structure and eigenanalysis.
Bottom Line:
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation.We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests.This means that we can predict the dataset size needed to detect structure.
View Article:
PubMed Central - PubMed
Affiliation: Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America.
ABSTRACT
Show MeSH
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general "phase change" phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure. |
Related In:
Results -
Collection
getmorefigures.php?uid=PMC1713260&req=5
Mentions: We generated simulated data, by taking a trifurcation between populations (A,B,D) 100 generations ago. Population C is a recent admixture of A and B. The mixing proportion of A in an individual from C is Beta-distributed B(3.5,1.5) so that the average contribution of population A in an individual of population C is .7 (see Figure 7). Effective population sizes are 10,000 for each population. We then simulated data for 10,000 unlinked markers (more details are in the Methods section). FST between any pair of A,B,D is .005. We are attempting to mimic the data of Figure 5, and chose to run our analysis on simulated samples from populations B,C,D, not using samples from A. We expect two significant eigenvalues corresponding to the splits of populations B,C, and D. If population A is included in the analysis, we also get just two significant eigenvalues, as predicted by theory. This is what is observed (unpublished data), with, as predicted, the admixed population not adding to the number of axes of variation (the third eigenvalue is not significant). In Figure 8 we show a plot of the first two eigenvectors. Note the dispersion of population C along a line. This is diagnostic of admixture. The resemblance of Figures 5 and 8 is striking. |
View Article: PubMed Central - PubMed
Affiliation: Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America.