Inferring population structure and relationship using minimal independent evolutionary markers in Y-chromosome: a hybrid approach of recursive feature selection for hierarchical clustering.
Bottom Line: An analysis of 105 world-wide populations reflected that 15 independent variations/markers were optimal in defining population structure parameters, such as FST, molecular variance and correlation-based relationship.A subsequent addition of randomly selected markers had a negligible effect (close to zero, i.e. 1 × 10(-3)) on these parameters.The study proves efficient in tracing complex population structures and deriving relationships among world-wide populations in a cost-effective and expedient manner.
Affiliation: National Centre of Applied Human Genetics, School of Life Sciences, Jawaharlal Nehru University, New Delhi 110067, India.Show MeSH
Mentions: In order to reach an optimal number of independent variables (evolutionary markers/SNPs) for resolving the population structure and relationship world-wide, we applied a combined approach of feature selection and hierarchical clustering for pruning of variables in human Y-chromosome (Figure 3). At each step, optimization was validated by several computational simulations, such as comparison of PCA plots, evaluation of population clusters and their validation, scrutiny of the purity of the resulting clusters and their comparison with already existing methods of feature selection. Population clustering was performed through three different methods, namely hierarchical clustering, K-medoid and K-means. The most optimal cluster size for each population set was determined by considering the PCA plots of populations (Figure 4), followed by evaluation of the Dunn index (47) and connectivity (48) for all cluster sizes (3–7) with different sets of markers (Supplementary Figure S3a, b and c). Later, the purity of clusters was compared with different marker sets for the most appropriate cluster size in each population set (Figure 5). Purity of clusters (Y-axis) as a measure of varying number of markers (X-axis) is represented in Figure 6a and b for a set of 50 and 79 populations, respectively. Population clustering ability of our methodology was also compared with two existing feature selection methods of information gain and χ2 (Table 1). These formed the basis for systematically designing the multiplexes to accommodate independent Y-chromosome evolutionary markers in a single multiplex and generate three subsequent continent-specific multiplexes for recently evolved populations.
Affiliation: National Centre of Applied Human Genetics, School of Life Sciences, Jawaharlal Nehru University, New Delhi 110067, India.