Limits...
StructHDP: automatic inference of number of clusters and population structure from admixed genotype data.

Shringarpure S, Won D, Xing EP - Bioinformatics (2011)

Bottom Line: We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data.Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model.We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. suyash@cs.cmu.edu

ABSTRACT

Motivation: Clustering of genotype data is an important way of understanding similarities and differences between populations. A summary of populations through clustering allows us to make inferences about the evolutionary history of the populations. Many methods have been proposed to perform clustering on multilocus genotype data. However, most of these methods do not directly address the question of how many clusters the data should be divided into and leave that choice to the user.

Methods: We present StructHDP, which is a method for automatically inferring the number of clusters from genotype data in the presence of admixture. Our method is an extension of two existing methods, Structure and Structurama. Using a Hierarchical Dirichlet Process (HDP), we model the presence of admixture of an unknown number of ancestral populations in a given sample of genotype data. We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data.

Results: To demonstrate our method, we simulated data from an island model using the neutral coalescent. Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model. We used StructHDP to analyze a dataset of 155 Taita thrush, Turdus helleri, which has been previously analyzed using Structure and Structurama. StructHDP correctly picks the optimal number of populations to cluster the data. The clustering based on the inferred ancestry proportions also agrees with that inferred using Structure for the optimal number of populations. We also analyzed data from 1048 individuals from the Human Genome Diversity project from 53 world populations. We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset.

Availability: StructHDP is written in C++. The code will be available for download at http://www.sailing.cs.cmu.edu/structhdp.

Contact: suyash@cs.cmu.edu; epxing@cs.cmu.edu.

Show MeSH

Related in: MedlinePlus

The distribution of the Mantel correlation statistic for the pairwise Euclidean distance matrix and the pairwise Fst distance matrix. The stem indicates the observed value of the statistic. The result is significant, with the associated P=0.0025.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117349&req=5

Figure 9: The distribution of the Mantel correlation statistic for the pairwise Euclidean distance matrix and the pairwise Fst distance matrix. The stem indicates the observed value of the statistic. The result is significant, with the associated P=0.0025.

Mentions: We hypothesized that if the inferred ancestry proportions capture the genetic variation between and across populations, then the pairwise Euclidean distance computed earlier should be correlated with genetic distance. To test this hypothesis, we computed the pairwise Fst distance between the seven continental divisions of the data. To test for correlation between the pairwise Euclidean distance matrix and the pairwise Fst distance matrix, we used a Mantel test. A Mantel test tests the alternate hypothesis of correlation between two matrices against the hypothesis of no correlation by permuting the rows and columns of one of the matrices and observing the distribution of the correlation statistic. The Mantel test on the Euclidean and Fst distance matrices shows that the correlation between the two distance matrices is 0.57 (P=0.0025 with 10 000 replicates). The distribution of the observed and simulated Mantel correlation statistic is shown in Figure 9. Thus, we can see that the Euclidean distance and Fst distance are strongly positively correlated, which supports the inferred population structure.Fig. 9.


StructHDP: automatic inference of number of clusters and population structure from admixed genotype data.

Shringarpure S, Won D, Xing EP - Bioinformatics (2011)

The distribution of the Mantel correlation statistic for the pairwise Euclidean distance matrix and the pairwise Fst distance matrix. The stem indicates the observed value of the statistic. The result is significant, with the associated P=0.0025.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117349&req=5

Figure 9: The distribution of the Mantel correlation statistic for the pairwise Euclidean distance matrix and the pairwise Fst distance matrix. The stem indicates the observed value of the statistic. The result is significant, with the associated P=0.0025.
Mentions: We hypothesized that if the inferred ancestry proportions capture the genetic variation between and across populations, then the pairwise Euclidean distance computed earlier should be correlated with genetic distance. To test this hypothesis, we computed the pairwise Fst distance between the seven continental divisions of the data. To test for correlation between the pairwise Euclidean distance matrix and the pairwise Fst distance matrix, we used a Mantel test. A Mantel test tests the alternate hypothesis of correlation between two matrices against the hypothesis of no correlation by permuting the rows and columns of one of the matrices and observing the distribution of the correlation statistic. The Mantel test on the Euclidean and Fst distance matrices shows that the correlation between the two distance matrices is 0.57 (P=0.0025 with 10 000 replicates). The distribution of the observed and simulated Mantel correlation statistic is shown in Figure 9. Thus, we can see that the Euclidean distance and Fst distance are strongly positively correlated, which supports the inferred population structure.Fig. 9.

Bottom Line: We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data.Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model.We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. suyash@cs.cmu.edu

ABSTRACT

Motivation: Clustering of genotype data is an important way of understanding similarities and differences between populations. A summary of populations through clustering allows us to make inferences about the evolutionary history of the populations. Many methods have been proposed to perform clustering on multilocus genotype data. However, most of these methods do not directly address the question of how many clusters the data should be divided into and leave that choice to the user.

Methods: We present StructHDP, which is a method for automatically inferring the number of clusters from genotype data in the presence of admixture. Our method is an extension of two existing methods, Structure and Structurama. Using a Hierarchical Dirichlet Process (HDP), we model the presence of admixture of an unknown number of ancestral populations in a given sample of genotype data. We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data.

Results: To demonstrate our method, we simulated data from an island model using the neutral coalescent. Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model. We used StructHDP to analyze a dataset of 155 Taita thrush, Turdus helleri, which has been previously analyzed using Structure and Structurama. StructHDP correctly picks the optimal number of populations to cluster the data. The clustering based on the inferred ancestry proportions also agrees with that inferred using Structure for the optimal number of populations. We also analyzed data from 1048 individuals from the Human Genome Diversity project from 53 world populations. We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset.

Availability: StructHDP is written in C++. The code will be available for download at http://www.sailing.cs.cmu.edu/structhdp.

Contact: suyash@cs.cmu.edu; epxing@cs.cmu.edu.

Show MeSH
Related in: MedlinePlus