Limits...
Model selection based on logistic regression in a highly correlated candidate gene region.

Uh HW, Mertens BJ, Jan van der Wijk H, Putter H, van Houwelingen HC, Houwing-Duistermaat JJ - BMC Proc (2007)

Bottom Line: The methods are illustrated using the simulated dense SNPs including the causal DR/C locus on chromosome 6.We also evaluate model selection in terms of average prediction error across nine replicates.We conclude that for the Genetic Analysis Workshop 15 (GAW15) data, the newly developed Bayesian selection method performs well.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands. h.uh@lumc.nl

ABSTRACT
Our aim is to develop methods for identifying a (causal) variant or variants from a dense panel of single-nucleotide polymorphisms (SNPs) that are genotyped on the evidence of previous studies. Because a large number of SNPs are in close proximity to each other, the magnitude of linkage disequilibrium (LD) plays an important role. Namely, highly correlated SNPs may hamper standard methods such as multivariate logistic regression due to multicolinearity between the covariates. Sequences of models with high dimension naturally raise questions about model selection strategies. We investigate three variable selection methods based on logistic regression. The penalties on stepwise selection were imposed using the Akaike's Information Criterion (AIC), and using the lasso penalty. Finally, a Bayesian variable-selection logistic regression model was implemented. The methods are illustrated using the simulated dense SNPs including the causal DR/C locus on chromosome 6. We also evaluate model selection in terms of average prediction error across nine replicates. We conclude that for the Genetic Analysis Workshop 15 (GAW15) data, the newly developed Bayesian selection method performs well.

No MeSH data available.


Related in: MedlinePlus

Relative importance of the SNPs across all models simulated. Two SNPs including the causal SNP 3437 were selected nearly 98% of all models.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2367469&req=5

Figure 1: Relative importance of the SNPs across all models simulated. Two SNPs including the causal SNP 3437 were selected nearly 98% of all models.

Mentions: Whereas the disease SNP 3437 was selected by all methods (Table 1), the Bayesian logistic variable selection regression model was the most parsimonious one, as it identified only two SNPs. In Figure 1, the number of times that each particular SNP was selected into the model across all models simulated was expressed as a percentage of the total number of models considered. The two SNPs were selected from nearly 98% of all models. To summarize classification performance from all models visited, we calculated the mean posterior class probability, and found sensitivity and specificity of 0.82, with the global misspecification error of 0.13. Figure 2 shows the ROC curve for the marginal mean posterior class probability and the area under the curve (AUC) equals to 0.933.


Model selection based on logistic regression in a highly correlated candidate gene region.

Uh HW, Mertens BJ, Jan van der Wijk H, Putter H, van Houwelingen HC, Houwing-Duistermaat JJ - BMC Proc (2007)

Relative importance of the SNPs across all models simulated. Two SNPs including the causal SNP 3437 were selected nearly 98% of all models.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2367469&req=5

Figure 1: Relative importance of the SNPs across all models simulated. Two SNPs including the causal SNP 3437 were selected nearly 98% of all models.
Mentions: Whereas the disease SNP 3437 was selected by all methods (Table 1), the Bayesian logistic variable selection regression model was the most parsimonious one, as it identified only two SNPs. In Figure 1, the number of times that each particular SNP was selected into the model across all models simulated was expressed as a percentage of the total number of models considered. The two SNPs were selected from nearly 98% of all models. To summarize classification performance from all models visited, we calculated the mean posterior class probability, and found sensitivity and specificity of 0.82, with the global misspecification error of 0.13. Figure 2 shows the ROC curve for the marginal mean posterior class probability and the area under the curve (AUC) equals to 0.933.

Bottom Line: The methods are illustrated using the simulated dense SNPs including the causal DR/C locus on chromosome 6.We also evaluate model selection in terms of average prediction error across nine replicates.We conclude that for the Genetic Analysis Workshop 15 (GAW15) data, the newly developed Bayesian selection method performs well.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands. h.uh@lumc.nl

ABSTRACT
Our aim is to develop methods for identifying a (causal) variant or variants from a dense panel of single-nucleotide polymorphisms (SNPs) that are genotyped on the evidence of previous studies. Because a large number of SNPs are in close proximity to each other, the magnitude of linkage disequilibrium (LD) plays an important role. Namely, highly correlated SNPs may hamper standard methods such as multivariate logistic regression due to multicolinearity between the covariates. Sequences of models with high dimension naturally raise questions about model selection strategies. We investigate three variable selection methods based on logistic regression. The penalties on stepwise selection were imposed using the Akaike's Information Criterion (AIC), and using the lasso penalty. Finally, a Bayesian variable-selection logistic regression model was implemented. The methods are illustrated using the simulated dense SNPs including the causal DR/C locus on chromosome 6. We also evaluate model selection in terms of average prediction error across nine replicates. We conclude that for the Genetic Analysis Workshop 15 (GAW15) data, the newly developed Bayesian selection method performs well.

No MeSH data available.


Related in: MedlinePlus