Limits...
Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia.

Morozova O, Levina O, Uusküla A, Heimer R - BMC Med Res Methodol (2015)

Bottom Line: Model selection with stepwise methods was highly unstable, with most (and all in case of backward elimination: BIC, forward selection: BIC, and backward elimination: LRT) of the selected variables being significant (95 % confidence interval for coefficient did not include zero).Adaptive elastic net demonstrated improved stability and more conservative estimates of coefficients and standard errors compared to stepwise.In situations of high uncertainty it is beneficial to apply different methodologically sound subset selection methods, and explore where their outputs do and do not agree.

View Article: PubMed Central - PubMed

Affiliation: Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA. olga.morozova@yale.edu.

ABSTRACT

Background: Automatic stepwise subset selection methods in linear regression often perform poorly, both in terms of variable selection and estimation of coefficients and standard errors, especially when number of independent variables is large and multicollinearity is present. Yet, stepwise algorithms remain the dominant method in medical and epidemiological research.

Methods: Performance of stepwise (backward elimination and forward selection algorithms using AIC, BIC, and Likelihood Ratio Test, p = 0.05 (LRT)) and alternative subset selection methods in linear regression, including Bayesian model averaging (BMA) and penalized regression (lasso, adaptive lasso, and adaptive elastic net) was investigated in a dataset from a cross-sectional study of drug users in St. Petersburg, Russia in 2012-2013. Dependent variable measured health-related quality of life, and independent correlates included 44 variables measuring demographics, behavioral, and structural factors.

Results: In our case study all methods returned models of different size and composition varying from 41 to 11 variables. The percentage of significant variables among those selected in final model varied from 100 % to 27 %. Model selection with stepwise methods was highly unstable, with most (and all in case of backward elimination: BIC, forward selection: BIC, and backward elimination: LRT) of the selected variables being significant (95 % confidence interval for coefficient did not include zero). Adaptive elastic net demonstrated improved stability and more conservative estimates of coefficients and standard errors compared to stepwise. By incorporating model uncertainty into subset selection and estimation of coefficients and their standard deviations, BMA returned a parsimonious model with the most conservative results in terms of covariates significance.

Conclusions: BMA and adaptive elastic net performed best in our analysis. Based on our results and previous theoretical studies the use of stepwise methods in medical and epidemiological research may be outperformed by alternative methods in cases such as ours. In situations of high uncertainty it is beneficial to apply different methodologically sound subset selection methods, and explore where their outputs do and do not agree. We recommend that researchers, at a minimum, should explore model uncertainty and stability as part of their analyses, and report these details in epidemiological papers.

No MeSH data available.


Related in: MedlinePlus

Bootstrap frequency of covariates selection in the final model using penalized regression. Dependent variable is EuroQoL 5D visual analogue scale measure of the health-related quality of life. a shows results of lasso corresponding to λmin, b—lasso corresponding to λ1se ; c and d—adaptive lasso with λmin (c) and λ1se (d); and e and f—adaptive elastic net with λmin (e) and λ1se (f). Black bars represent variables selected in the final model, and light grey bars—variables excluded from the final model. Solid line and the number next to it correspond to the minimum frequency among variables included in the final model; dashed line and the number next to it correspond to the maximum frequency among variables excluded from final subset. Dotted line corresponds to the frequency = 0.9, and number next to it shows the percentage of variables in the final model with inclusion frequency over 0.9 (out of the number of variables selected in the final model). Description of variable names is provided in the Additional file 2
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4553217&req=5

Fig2: Bootstrap frequency of covariates selection in the final model using penalized regression. Dependent variable is EuroQoL 5D visual analogue scale measure of the health-related quality of life. a shows results of lasso corresponding to λmin, b—lasso corresponding to λ1se ; c and d—adaptive lasso with λmin (c) and λ1se (d); and e and f—adaptive elastic net with λmin (e) and λ1se (f). Black bars represent variables selected in the final model, and light grey bars—variables excluded from the final model. Solid line and the number next to it correspond to the minimum frequency among variables included in the final model; dashed line and the number next to it correspond to the maximum frequency among variables excluded from final subset. Dotted line corresponds to the frequency = 0.9, and number next to it shows the percentage of variables in the final model with inclusion frequency over 0.9 (out of the number of variables selected in the final model). Description of variable names is provided in the Additional file 2

Mentions: Detailed outputs of lasso, adaptive lasso, and adaptive elastic net regressions are presented in the Additional file 6, and include regularization paths, graphs of CV results, estimates of regression coefficients along with 95 % CIs, and the bootstrap inclusion frequencies for variables. For lasso the highest inclusion frequency among non-selected variables was bigger than the lowest inclusion frequency among selected variables (both for λmin and λ1se). In adaptive lasso and adaptive elastic net, however, the situation was opposite—inclusion frequencies of selected model variables were bigger than those of non-selected in all cases except adaptive elastic net λ1se, where the corresponding frequencies were equal (Fig. 2). Adaptive lasso demonstrated better stability in terms of the difference between the lowest selected and the highest non-selected variables inclusion frequency. On the other hand, lasso and adaptive elastic net demonstrated better performance in terms of percentage of model variables with inclusion frequency over 0.9 (Fig. 2). The number of significant variables was generally smaller in all penalized regression methods compared to stepwise methods (Fig. 4).Fig. 2


Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia.

Morozova O, Levina O, Uusküla A, Heimer R - BMC Med Res Methodol (2015)

Bootstrap frequency of covariates selection in the final model using penalized regression. Dependent variable is EuroQoL 5D visual analogue scale measure of the health-related quality of life. a shows results of lasso corresponding to λmin, b—lasso corresponding to λ1se ; c and d—adaptive lasso with λmin (c) and λ1se (d); and e and f—adaptive elastic net with λmin (e) and λ1se (f). Black bars represent variables selected in the final model, and light grey bars—variables excluded from the final model. Solid line and the number next to it correspond to the minimum frequency among variables included in the final model; dashed line and the number next to it correspond to the maximum frequency among variables excluded from final subset. Dotted line corresponds to the frequency = 0.9, and number next to it shows the percentage of variables in the final model with inclusion frequency over 0.9 (out of the number of variables selected in the final model). Description of variable names is provided in the Additional file 2
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4553217&req=5

Fig2: Bootstrap frequency of covariates selection in the final model using penalized regression. Dependent variable is EuroQoL 5D visual analogue scale measure of the health-related quality of life. a shows results of lasso corresponding to λmin, b—lasso corresponding to λ1se ; c and d—adaptive lasso with λmin (c) and λ1se (d); and e and f—adaptive elastic net with λmin (e) and λ1se (f). Black bars represent variables selected in the final model, and light grey bars—variables excluded from the final model. Solid line and the number next to it correspond to the minimum frequency among variables included in the final model; dashed line and the number next to it correspond to the maximum frequency among variables excluded from final subset. Dotted line corresponds to the frequency = 0.9, and number next to it shows the percentage of variables in the final model with inclusion frequency over 0.9 (out of the number of variables selected in the final model). Description of variable names is provided in the Additional file 2
Mentions: Detailed outputs of lasso, adaptive lasso, and adaptive elastic net regressions are presented in the Additional file 6, and include regularization paths, graphs of CV results, estimates of regression coefficients along with 95 % CIs, and the bootstrap inclusion frequencies for variables. For lasso the highest inclusion frequency among non-selected variables was bigger than the lowest inclusion frequency among selected variables (both for λmin and λ1se). In adaptive lasso and adaptive elastic net, however, the situation was opposite—inclusion frequencies of selected model variables were bigger than those of non-selected in all cases except adaptive elastic net λ1se, where the corresponding frequencies were equal (Fig. 2). Adaptive lasso demonstrated better stability in terms of the difference between the lowest selected and the highest non-selected variables inclusion frequency. On the other hand, lasso and adaptive elastic net demonstrated better performance in terms of percentage of model variables with inclusion frequency over 0.9 (Fig. 2). The number of significant variables was generally smaller in all penalized regression methods compared to stepwise methods (Fig. 4).Fig. 2

Bottom Line: Model selection with stepwise methods was highly unstable, with most (and all in case of backward elimination: BIC, forward selection: BIC, and backward elimination: LRT) of the selected variables being significant (95 % confidence interval for coefficient did not include zero).Adaptive elastic net demonstrated improved stability and more conservative estimates of coefficients and standard errors compared to stepwise.In situations of high uncertainty it is beneficial to apply different methodologically sound subset selection methods, and explore where their outputs do and do not agree.

View Article: PubMed Central - PubMed

Affiliation: Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA. olga.morozova@yale.edu.

ABSTRACT

Background: Automatic stepwise subset selection methods in linear regression often perform poorly, both in terms of variable selection and estimation of coefficients and standard errors, especially when number of independent variables is large and multicollinearity is present. Yet, stepwise algorithms remain the dominant method in medical and epidemiological research.

Methods: Performance of stepwise (backward elimination and forward selection algorithms using AIC, BIC, and Likelihood Ratio Test, p = 0.05 (LRT)) and alternative subset selection methods in linear regression, including Bayesian model averaging (BMA) and penalized regression (lasso, adaptive lasso, and adaptive elastic net) was investigated in a dataset from a cross-sectional study of drug users in St. Petersburg, Russia in 2012-2013. Dependent variable measured health-related quality of life, and independent correlates included 44 variables measuring demographics, behavioral, and structural factors.

Results: In our case study all methods returned models of different size and composition varying from 41 to 11 variables. The percentage of significant variables among those selected in final model varied from 100 % to 27 %. Model selection with stepwise methods was highly unstable, with most (and all in case of backward elimination: BIC, forward selection: BIC, and backward elimination: LRT) of the selected variables being significant (95 % confidence interval for coefficient did not include zero). Adaptive elastic net demonstrated improved stability and more conservative estimates of coefficients and standard errors compared to stepwise. By incorporating model uncertainty into subset selection and estimation of coefficients and their standard deviations, BMA returned a parsimonious model with the most conservative results in terms of covariates significance.

Conclusions: BMA and adaptive elastic net performed best in our analysis. Based on our results and previous theoretical studies the use of stepwise methods in medical and epidemiological research may be outperformed by alternative methods in cases such as ours. In situations of high uncertainty it is beneficial to apply different methodologically sound subset selection methods, and explore where their outputs do and do not agree. We recommend that researchers, at a minimum, should explore model uncertainty and stability as part of their analyses, and report these details in epidemiological papers.

No MeSH data available.


Related in: MedlinePlus