Limits...
StructHDP: automatic inference of number of clusters and population structure from admixed genotype data.

Shringarpure S, Won D, Xing EP - Bioinformatics (2011)

Bottom Line: We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data.Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model.We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. suyash@cs.cmu.edu

ABSTRACT

Motivation: Clustering of genotype data is an important way of understanding similarities and differences between populations. A summary of populations through clustering allows us to make inferences about the evolutionary history of the populations. Many methods have been proposed to perform clustering on multilocus genotype data. However, most of these methods do not directly address the question of how many clusters the data should be divided into and leave that choice to the user.

Methods: We present StructHDP, which is a method for automatically inferring the number of clusters from genotype data in the presence of admixture. Our method is an extension of two existing methods, Structure and Structurama. Using a Hierarchical Dirichlet Process (HDP), we model the presence of admixture of an unknown number of ancestral populations in a given sample of genotype data. We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data.

Results: To demonstrate our method, we simulated data from an island model using the neutral coalescent. Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model. We used StructHDP to analyze a dataset of 155 Taita thrush, Turdus helleri, which has been previously analyzed using Structure and Structurama. StructHDP correctly picks the optimal number of populations to cluster the data. The clustering based on the inferred ancestry proportions also agrees with that inferred using Structure for the optimal number of populations. We also analyzed data from 1048 individuals from the Human Genome Diversity project from 53 world populations. We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset.

Availability: StructHDP is written in C++. The code will be available for download at http://www.sailing.cs.cmu.edu/structhdp.

Contact: suyash@cs.cmu.edu; epxing@cs.cmu.edu.

Show MeSH

Related in: MedlinePlus

The ancestry proportions for the 1048 individuals from the Human Genome Diversity Project inferred by StructHDP. Each thin line denotes the ancestry proportions for a single individual. Different colors correspond to different ancestral populations. Dark black lines separate individuals from different major geographical divisions. The geographical divisions are indicated by the labels on top of the graph.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117349&req=5

Figure 7: The ancestry proportions for the 1048 individuals from the Human Genome Diversity Project inferred by StructHDP. Each thin line denotes the ancestry proportions for a single individual. Different colors correspond to different ancestral populations. Dark black lines separate individuals from different major geographical divisions. The geographical divisions are indicated by the labels on top of the graph.

Mentions: To analyze these results further, we plotted the ancestry proportions of the 1048 individuals as a bar graph, where every individual is represented by a thin bar with four components which sum to 1. Figure 7 shows the resulting bar graph. We can see that the clusters obtained correspond to the major geographical divisions of the world and the ancestral populations can be roughly described as ancestral African (denoted by green color), ancestral American-East Asian (blue), ancestral European (yellow) and ancestral Oceanian (red). From the ancestry proportions, we can see that the modern East Asian populations and American populations are similar, with the modern East Asian populations having a larger contribution from the ancestral population corresponding to Europe. Modern Asian populations also show some Oceanic ancestry (from the ancestral population denoted by red color). Modern Central and South Asian populations show an admixture of European and East Asian ancestral populations. The Middle Eastern populations show contributions from the ancestral African population and the ancestral European population. Modern Oceanic populations are an admixture of an ancestral Oceanic population with an ancestral East Asian population. All these observations are in agreement with previous analyses of the data by Rosenberg et al. (2002) and other studies of regional populations. We should note that the clusters inferred by StructHDP are not identical to the ones observed by Rosenberg et al. for K=4. Rosenberg et al. observe that East Asia separates out into a separate cluster for K=4 while Oceania separates from the rest of the data only for values of K larger than 4.Fig. 7.


StructHDP: automatic inference of number of clusters and population structure from admixed genotype data.

Shringarpure S, Won D, Xing EP - Bioinformatics (2011)

The ancestry proportions for the 1048 individuals from the Human Genome Diversity Project inferred by StructHDP. Each thin line denotes the ancestry proportions for a single individual. Different colors correspond to different ancestral populations. Dark black lines separate individuals from different major geographical divisions. The geographical divisions are indicated by the labels on top of the graph.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117349&req=5

Figure 7: The ancestry proportions for the 1048 individuals from the Human Genome Diversity Project inferred by StructHDP. Each thin line denotes the ancestry proportions for a single individual. Different colors correspond to different ancestral populations. Dark black lines separate individuals from different major geographical divisions. The geographical divisions are indicated by the labels on top of the graph.
Mentions: To analyze these results further, we plotted the ancestry proportions of the 1048 individuals as a bar graph, where every individual is represented by a thin bar with four components which sum to 1. Figure 7 shows the resulting bar graph. We can see that the clusters obtained correspond to the major geographical divisions of the world and the ancestral populations can be roughly described as ancestral African (denoted by green color), ancestral American-East Asian (blue), ancestral European (yellow) and ancestral Oceanian (red). From the ancestry proportions, we can see that the modern East Asian populations and American populations are similar, with the modern East Asian populations having a larger contribution from the ancestral population corresponding to Europe. Modern Asian populations also show some Oceanic ancestry (from the ancestral population denoted by red color). Modern Central and South Asian populations show an admixture of European and East Asian ancestral populations. The Middle Eastern populations show contributions from the ancestral African population and the ancestral European population. Modern Oceanic populations are an admixture of an ancestral Oceanic population with an ancestral East Asian population. All these observations are in agreement with previous analyses of the data by Rosenberg et al. (2002) and other studies of regional populations. We should note that the clusters inferred by StructHDP are not identical to the ones observed by Rosenberg et al. for K=4. Rosenberg et al. observe that East Asia separates out into a separate cluster for K=4 while Oceania separates from the rest of the data only for values of K larger than 4.Fig. 7.

Bottom Line: We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data.Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model.We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. suyash@cs.cmu.edu

ABSTRACT

Motivation: Clustering of genotype data is an important way of understanding similarities and differences between populations. A summary of populations through clustering allows us to make inferences about the evolutionary history of the populations. Many methods have been proposed to perform clustering on multilocus genotype data. However, most of these methods do not directly address the question of how many clusters the data should be divided into and leave that choice to the user.

Methods: We present StructHDP, which is a method for automatically inferring the number of clusters from genotype data in the presence of admixture. Our method is an extension of two existing methods, Structure and Structurama. Using a Hierarchical Dirichlet Process (HDP), we model the presence of admixture of an unknown number of ancestral populations in a given sample of genotype data. We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data.

Results: To demonstrate our method, we simulated data from an island model using the neutral coalescent. Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model. We used StructHDP to analyze a dataset of 155 Taita thrush, Turdus helleri, which has been previously analyzed using Structure and Structurama. StructHDP correctly picks the optimal number of populations to cluster the data. The clustering based on the inferred ancestry proportions also agrees with that inferred using Structure for the optimal number of populations. We also analyzed data from 1048 individuals from the Human Genome Diversity project from 53 world populations. We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset.

Availability: StructHDP is written in C++. The code will be available for download at http://www.sailing.cs.cmu.edu/structhdp.

Contact: suyash@cs.cmu.edu; epxing@cs.cmu.edu.

Show MeSH
Related in: MedlinePlus