Limits...
A novel similarity-measure for the analysis of genetic data in complex phenotypes.

Lagani V, Montesanto A, Di Cianni F, Moreno V, Landi S, Conforti D, Rose G, Passarino G - BMC Bioinformatics (2009)

Bottom Line: The effectiveness of the Hardy-Weinberg kernel was compared to the performance of the well established linear kernel.We found that the Hardy-Weinberg kernel significantly outperformed the linear kernel in a number of experiments where we used either simulated data or real data.The ability to capture the effect of rare genotypes on phenotypic traits might be a very important and useful feature, as most of the current statistical tools loose most of their statistical power when rare genotypes are involved in the susceptibility to the trait under study.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Electronis, Informatics and Systems, University of Calabria, Via Ponte Pietro Bucci 41C, 87036, Rende, Italy. vlagani@deis.unical.it

ABSTRACT

Background: Recent technological advances in DNA sequencing and genotyping have led to the accumulation of a remarkable quantity of data on genetic polymorphisms. However, the development of new statistical and computational tools for effective processing of these data has not been equally as fast. In particular, Machine Learning literature is limited to relatively few papers which are focused on the development and application of data mining methods for the analysis of genetic variability. On the other hand, these papers apply to genetic data procedures which had been developed for a different kind of analysis and do not take into account the peculiarities of population genetics. The aim of our study was to define a new similarity measure, specifically conceived for measuring the similarity between the genetic profiles of two groups of subjects (i.e., cases and controls) taking into account that genetic profiles are usually distributed in a population group according to the Hardy Weinberg equilibrium.

Results: We set up a new kernel function consisting of a similarity measure between groups of subjects genotyped for numerous genetic loci. This measure weighs different genetic profiles according to the estimates of gene frequencies at Hardy-Weinberg equilibrium in the population. We named this function the "Hardy-Weinberg kernel". The effectiveness of the Hardy-Weinberg kernel was compared to the performance of the well established linear kernel. We found that the Hardy-Weinberg kernel significantly outperformed the linear kernel in a number of experiments where we used either simulated data or real data.

Conclusion: The "Hardy-Weinberg kernel" reported here represents one of the first attempts at incorporating genetic knowledge into the definition of a kernel function designed for the analysis of genetic data. We show that the best performance of the "Hardy-Weinberg kernel" is observed when rare genotypes have different frequencies in cases and controls. The ability to capture the effect of rare genotypes on phenotypic traits might be a very important and useful feature, as most of the current statistical tools loose most of their statistical power when rare genotypes are involved in the susceptibility to the trait under study.

Show MeSH
Mean values of the AUC performances computed on the five repeated cross-validations obtained using a SVM classifier with either HWk (black) and the linear kernel (red) with the SCC dataset. The results refer to a threshold T for the univariate feature selection procedure of respectively 0.20 (a), 0.15 (b) and 0.10 (c).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2697648&req=5

Figure 2: Mean values of the AUC performances computed on the five repeated cross-validations obtained using a SVM classifier with either HWk (black) and the linear kernel (red) with the SCC dataset. The results refer to a threshold T for the univariate feature selection procedure of respectively 0.20 (a), 0.15 (b) and 0.10 (c).

Mentions: The results using the SCC dataset confirm the characteristics of the HWk already pointed out with the experiments on simulated datasets. Figures 2a, 2b and 2c report the results for the SCC dataset with a threshold T for the univariate feature selection procedure of respectively 0.2, 0.15 and 0.10. The results for each repetition of the cross-validation procedure are reported in Additional file 1, Tables S1a S1b and S1c. We can observe that the performances of HWk are better for small values, while the two kernels shows similar results with higher C values.


A novel similarity-measure for the analysis of genetic data in complex phenotypes.

Lagani V, Montesanto A, Di Cianni F, Moreno V, Landi S, Conforti D, Rose G, Passarino G - BMC Bioinformatics (2009)

Mean values of the AUC performances computed on the five repeated cross-validations obtained using a SVM classifier with either HWk (black) and the linear kernel (red) with the SCC dataset. The results refer to a threshold T for the univariate feature selection procedure of respectively 0.20 (a), 0.15 (b) and 0.10 (c).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2697648&req=5

Figure 2: Mean values of the AUC performances computed on the five repeated cross-validations obtained using a SVM classifier with either HWk (black) and the linear kernel (red) with the SCC dataset. The results refer to a threshold T for the univariate feature selection procedure of respectively 0.20 (a), 0.15 (b) and 0.10 (c).
Mentions: The results using the SCC dataset confirm the characteristics of the HWk already pointed out with the experiments on simulated datasets. Figures 2a, 2b and 2c report the results for the SCC dataset with a threshold T for the univariate feature selection procedure of respectively 0.2, 0.15 and 0.10. The results for each repetition of the cross-validation procedure are reported in Additional file 1, Tables S1a S1b and S1c. We can observe that the performances of HWk are better for small values, while the two kernels shows similar results with higher C values.

Bottom Line: The effectiveness of the Hardy-Weinberg kernel was compared to the performance of the well established linear kernel.We found that the Hardy-Weinberg kernel significantly outperformed the linear kernel in a number of experiments where we used either simulated data or real data.The ability to capture the effect of rare genotypes on phenotypic traits might be a very important and useful feature, as most of the current statistical tools loose most of their statistical power when rare genotypes are involved in the susceptibility to the trait under study.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Electronis, Informatics and Systems, University of Calabria, Via Ponte Pietro Bucci 41C, 87036, Rende, Italy. vlagani@deis.unical.it

ABSTRACT

Background: Recent technological advances in DNA sequencing and genotyping have led to the accumulation of a remarkable quantity of data on genetic polymorphisms. However, the development of new statistical and computational tools for effective processing of these data has not been equally as fast. In particular, Machine Learning literature is limited to relatively few papers which are focused on the development and application of data mining methods for the analysis of genetic variability. On the other hand, these papers apply to genetic data procedures which had been developed for a different kind of analysis and do not take into account the peculiarities of population genetics. The aim of our study was to define a new similarity measure, specifically conceived for measuring the similarity between the genetic profiles of two groups of subjects (i.e., cases and controls) taking into account that genetic profiles are usually distributed in a population group according to the Hardy Weinberg equilibrium.

Results: We set up a new kernel function consisting of a similarity measure between groups of subjects genotyped for numerous genetic loci. This measure weighs different genetic profiles according to the estimates of gene frequencies at Hardy-Weinberg equilibrium in the population. We named this function the "Hardy-Weinberg kernel". The effectiveness of the Hardy-Weinberg kernel was compared to the performance of the well established linear kernel. We found that the Hardy-Weinberg kernel significantly outperformed the linear kernel in a number of experiments where we used either simulated data or real data.

Conclusion: The "Hardy-Weinberg kernel" reported here represents one of the first attempts at incorporating genetic knowledge into the definition of a kernel function designed for the analysis of genetic data. We show that the best performance of the "Hardy-Weinberg kernel" is observed when rare genotypes have different frequencies in cases and controls. The ability to capture the effect of rare genotypes on phenotypic traits might be a very important and useful feature, as most of the current statistical tools loose most of their statistical power when rare genotypes are involved in the susceptibility to the trait under study.

Show MeSH