Inferring population structure and relationship using minimal independent evolutionary markers in Y-chromosome: a hybrid approach of recursive feature selection for hierarchical clustering.
Bottom Line: An analysis of 105 world-wide populations reflected that 15 independent variations/markers were optimal in defining population structure parameters, such as FST, molecular variance and correlation-based relationship.A subsequent addition of randomly selected markers had a negligible effect (close to zero, i.e. 1 × 10(-3)) on these parameters.The study proves efficient in tracing complex population structures and deriving relationships among world-wide populations in a cost-effective and expedient manner.
Affiliation: National Centre of Applied Human Genetics, School of Life Sciences, Jawaharlal Nehru University, New Delhi 110067, India.Show MeSH
Mentions: Current study (Figure 2) has taken care of the problems of high-dimensionality and expensive genotyping methods simultaneously. The problem of high-dimensionality was attended to by the selection of highly informative independent Y-chromosomal markers (features) through a novel approach of ‘recursive feature selection for hierarchical clustering (RFSHC)’. Our approach utilized recursive selection of features through variable ranking on the basis of Pearson's correlation coefficient (PCC) embedded with agglomerative (bottom up) hierarchical clustering based on judicious use of phylogeny of Y-chromosomal haplogroups. The approach was initially applied on a dataset of 50 populations. Later, observations from above dataset were confirmed on two datasets of 79 and 105 populations. Several computational analyses such as principal component analysis (PCA) plots, cluster validation, purity of clusters and their comparison with already existing methods of feature selection were performed to prove the authenticity of our novel approach. Further, to cut the cost as much as possible without compromising on the ability of estimating population structure, these independent markers were multiplexed together into a single multiplex by using a medium-throughput MALDI-TOF-MS platform ‘SEQUENOM’. In addition, recently evolved haplogroups representing lower nodes in Y-chromosome hierarchy were accommodated in subsequent three multiplexes in a continent-specific manner to check even minor changes in the resolution of population structure and relationship, if any. Moreover, newly designed multiplexes consisting of highly informative-independent features were genotyped for two geographically independent Indian population groups (North India and East India) and data was analyzed along with 105 world-wide populations (datasets of 50, 79 and 105 populations) for population structure parameters such as population differentiation (FST) and molecular variance. The results illustrated that an optimal set of 15 independent Y-chromosomal markers was sufficient to infer populations’ structure and relationship with equivalent resolution and precision as would be deduced after the use of a larger set of markers (Figure 2).
Affiliation: National Centre of Applied Human Genetics, School of Life Sciences, Jawaharlal Nehru University, New Delhi 110067, India.