Limits...
Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study.

Marshall A, Altman DG, Royston P, Holder RL - BMC Med Res Methodol (2010)

Bottom Line: Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness.The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.

View Article: PubMed Central - HTML - PubMed

Affiliation: Centre for Statistics in Medicine, University of Oxford, Oxford, UK. andrea.marshall@warwick.ac.uk

ABSTRACT

Background: There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.

Methods: Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained.

Results: Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches.

Conclusion: The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.

Show MeSH

Related in: MedlinePlus

Coverage of the regression coefficient estimates for different missing data methods for increasing percentage of MAR missingness.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2824146&req=5

Figure 4: Coverage of the regression coefficient estimates for different missing data methods for increasing percentage of MAR missingness.

Mentions: Figure 4 shows the coverage of the regression coefficient estimates for the different missing data methods in relation to an increasing percentage of MAR missingness. The coverage of the true value within the confidence limits constructed after performing a CC analysis remained around the nominal 95% level for all amounts of missingness (Figure 4). The underestimated SE with SI resulted in much shorter confidence intervals than with the other missing data methods and poorer coverage especially for the incomplete covariates and also X4. Even with 10% missingness, the coverage for the regression coefficient estimates associated with X2 and X3 was around 90% using SI.


Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study.

Marshall A, Altman DG, Royston P, Holder RL - BMC Med Res Methodol (2010)

Coverage of the regression coefficient estimates for different missing data methods for increasing percentage of MAR missingness.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2824146&req=5

Figure 4: Coverage of the regression coefficient estimates for different missing data methods for increasing percentage of MAR missingness.
Mentions: Figure 4 shows the coverage of the regression coefficient estimates for the different missing data methods in relation to an increasing percentage of MAR missingness. The coverage of the true value within the confidence limits constructed after performing a CC analysis remained around the nominal 95% level for all amounts of missingness (Figure 4). The underestimated SE with SI resulted in much shorter confidence intervals than with the other missing data methods and poorer coverage especially for the incomplete covariates and also X4. Even with 10% missingness, the coverage for the regression coefficient estimates associated with X2 and X3 was around 90% using SI.

Bottom Line: Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness.The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.

View Article: PubMed Central - HTML - PubMed

Affiliation: Centre for Statistics in Medicine, University of Oxford, Oxford, UK. andrea.marshall@warwick.ac.uk

ABSTRACT

Background: There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.

Methods: Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained.

Results: Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches.

Conclusion: The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.

Show MeSH
Related in: MedlinePlus