Limits...
A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization.

Hornung R, Bernau C, Truntzer C, Wilson R, Stadler T, Boulesteix AL - BMC Med Res Methodol (2015)

Bottom Line: Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration.Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values.In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings.

View Article: PubMed Central - PubMed

Affiliation: Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, D-81377, Germany. hornung@ibe.med.uni-muenchen.de.

ABSTRACT

Background: In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset-in its entirety-before training/test set based prediction error estimation by cross-validation (CV)-an approach referred to as "incomplete CV". Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values.

Methods: We devise the easily interpretable and general measure CVIIM ("CV Incompleteness Impact Measure") to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA.

Results: Performing normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings.

Conclusions: While the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias.

No MeSH data available.


Dependency on CV errors in PCA study. Upper panel: CVIIMs,n,K-values versus efull,K(s)-values for all settings; Lower panel: Zero-truncated differences of efull,K(s)- and eincompl,K(s)-values versus efull,K(s)-values for all settings. The colors and numbers distinguish the different datasets. The filled black circles depict the respective means over the results of all settings obtained on the specific datasets
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4634762&req=5

Fig4: Dependency on CV errors in PCA study. Upper panel: CVIIMs,n,K-values versus efull,K(s)-values for all settings; Lower panel: Zero-truncated differences of efull,K(s)- and eincompl,K(s)-values versus efull,K(s)-values for all settings. The colors and numbers distinguish the different datasets. The filled black circles depict the respective means over the results of all settings obtained on the specific datasets

Mentions: An important question with respect to the definition of CVIIM is whether it depends on E[efull,K(S)]. Such a dependence is not desirable, since CVIIM should not be a measure of the error but of the impact of CV incompleteness. To investigate this in the context of the PCA study, we plot CVIIMs,n,K against efull,K(s) in the upper panel of Fig. 4, where the different analysis settings for a given dataset are represented using the same colour and number, and the mean of each dataset is displayed as a black point. This plot suggests no relevant dependency of CVIIMs,n,K on the full CV error efull,K(s). For two of the smallest errors we observe extreme CVIIM-estimates, resulting from random fluctuations in the error estimates as discussed in Appendix B.3 (Additional file 2). However, this problem—concerning only two values out of 480 error values in total—seems to be negligible. The lower panel of Fig. 4 displays the zero-truncated difference between efull,K(s) and eincompl,K(s) against efull,K(s). This plot clearly suggests a comparatively strong dependence of the estimates of this measure on the full CV error—as also observed in the results obtained in the simulation study presented in Appendix A (Additional file 2)—and thus provides evidence supporting the use of a ratio-based measure rather than a difference-based measure. Analogous plots give a very similar picture in the case of normalization; see Figure S6 in Appendix C (Additional file 2).Fig. 4


A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization.

Hornung R, Bernau C, Truntzer C, Wilson R, Stadler T, Boulesteix AL - BMC Med Res Methodol (2015)

Dependency on CV errors in PCA study. Upper panel: CVIIMs,n,K-values versus efull,K(s)-values for all settings; Lower panel: Zero-truncated differences of efull,K(s)- and eincompl,K(s)-values versus efull,K(s)-values for all settings. The colors and numbers distinguish the different datasets. The filled black circles depict the respective means over the results of all settings obtained on the specific datasets
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4634762&req=5

Fig4: Dependency on CV errors in PCA study. Upper panel: CVIIMs,n,K-values versus efull,K(s)-values for all settings; Lower panel: Zero-truncated differences of efull,K(s)- and eincompl,K(s)-values versus efull,K(s)-values for all settings. The colors and numbers distinguish the different datasets. The filled black circles depict the respective means over the results of all settings obtained on the specific datasets
Mentions: An important question with respect to the definition of CVIIM is whether it depends on E[efull,K(S)]. Such a dependence is not desirable, since CVIIM should not be a measure of the error but of the impact of CV incompleteness. To investigate this in the context of the PCA study, we plot CVIIMs,n,K against efull,K(s) in the upper panel of Fig. 4, where the different analysis settings for a given dataset are represented using the same colour and number, and the mean of each dataset is displayed as a black point. This plot suggests no relevant dependency of CVIIMs,n,K on the full CV error efull,K(s). For two of the smallest errors we observe extreme CVIIM-estimates, resulting from random fluctuations in the error estimates as discussed in Appendix B.3 (Additional file 2). However, this problem—concerning only two values out of 480 error values in total—seems to be negligible. The lower panel of Fig. 4 displays the zero-truncated difference between efull,K(s) and eincompl,K(s) against efull,K(s). This plot clearly suggests a comparatively strong dependence of the estimates of this measure on the full CV error—as also observed in the results obtained in the simulation study presented in Appendix A (Additional file 2)—and thus provides evidence supporting the use of a ratio-based measure rather than a difference-based measure. Analogous plots give a very similar picture in the case of normalization; see Figure S6 in Appendix C (Additional file 2).Fig. 4

Bottom Line: Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration.Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values.In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings.

View Article: PubMed Central - PubMed

Affiliation: Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, D-81377, Germany. hornung@ibe.med.uni-muenchen.de.

ABSTRACT

Background: In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset-in its entirety-before training/test set based prediction error estimation by cross-validation (CV)-an approach referred to as "incomplete CV". Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values.

Methods: We devise the easily interpretable and general measure CVIIM ("CV Incompleteness Impact Measure") to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA.

Results: Performing normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings.

Conclusions: While the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias.

No MeSH data available.