Limits...
Stratification bias in low signal microarray studies.

Parker BJ, Günter S, Bedo J - BMC Bioinformatics (2007)

Bottom Line: In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5.Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided.In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.

View Article: PubMed Central - HTML - PubMed

Affiliation: Statistical Machine Learning Group, NICTA, Canberra, Australia. brian.bj.parker@gmail.com

ABSTRACT

Background: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated.

Results: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice.

Conclusion: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.

Show MeSH

Related in: MedlinePlus

Standard deviations of AUC estimates for van 't Veer dataset using linear SVM.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2211509&req=5

Figure 10: Standard deviations of AUC estimates for van 't Veer dataset using linear SVM.

Mentions: Figure 9 shows the results of AUC estimation for the van 't Veer breast cancer dataset using SVM classification (with linear kernel), comparing both the pooling and averaging strategies. Figure 9(a) shows the results for cross-validation; 9(b) shows the results for LOOCV and bootstrap. The results for AUC calculation using a pooling strategy show that there is a substantial systematic pessimistic bias for the unstratified versions of cross-validation and bootstrap compared with the stratified versions. Also, stratified CV still shows some downward bias compared with balanced, stratified CV. LOOCV, which can only be used with a pooling strategy, shows substantial biases unless the balanced version, balanced LOOCV, is used. The performance of the balanced LOOCV is superior to the other pooled methods at small sample sizes as the training set of the folds is as large as possible. AUCs computed using the averaging strategy, by contrast, are not shifted relative to each other, when using either stratified or unstratified validation, and all reach the same asymptotic value, indicating that they do not suffer from the stratification bias. The results on the full dataset for the averaged AUC estimates are slightly higher than those for the pooling methods, including the stratified and balanced versions. This is due to the attenuation due to non-systematic classifier differences across folds, as described previously. Figure 10(a) and 10(b) show the standard deviations for figures 9(a) and 9(b), respectively. Note that the variance with stratified CV and balanced, stratified CV when using the averaging method is lower compared with unstratified CV, suggesting that such stratified validation schemes can give a worthwhile improvement in variance when used with averaged AUC estimation. Repeated cross validation would lower the variance further, as discussed in the simulation section. The Hanley-McNeil estimate for a sample size of 50 is 0.07, which approximately matches the empirical standard deviations at this sample size. The stratification bias at this sample size for linear SVM (C = 0.01) is substantially less than the SE in this case.


Stratification bias in low signal microarray studies.

Parker BJ, Günter S, Bedo J - BMC Bioinformatics (2007)

Standard deviations of AUC estimates for van 't Veer dataset using linear SVM.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2211509&req=5

Figure 10: Standard deviations of AUC estimates for van 't Veer dataset using linear SVM.
Mentions: Figure 9 shows the results of AUC estimation for the van 't Veer breast cancer dataset using SVM classification (with linear kernel), comparing both the pooling and averaging strategies. Figure 9(a) shows the results for cross-validation; 9(b) shows the results for LOOCV and bootstrap. The results for AUC calculation using a pooling strategy show that there is a substantial systematic pessimistic bias for the unstratified versions of cross-validation and bootstrap compared with the stratified versions. Also, stratified CV still shows some downward bias compared with balanced, stratified CV. LOOCV, which can only be used with a pooling strategy, shows substantial biases unless the balanced version, balanced LOOCV, is used. The performance of the balanced LOOCV is superior to the other pooled methods at small sample sizes as the training set of the folds is as large as possible. AUCs computed using the averaging strategy, by contrast, are not shifted relative to each other, when using either stratified or unstratified validation, and all reach the same asymptotic value, indicating that they do not suffer from the stratification bias. The results on the full dataset for the averaged AUC estimates are slightly higher than those for the pooling methods, including the stratified and balanced versions. This is due to the attenuation due to non-systematic classifier differences across folds, as described previously. Figure 10(a) and 10(b) show the standard deviations for figures 9(a) and 9(b), respectively. Note that the variance with stratified CV and balanced, stratified CV when using the averaging method is lower compared with unstratified CV, suggesting that such stratified validation schemes can give a worthwhile improvement in variance when used with averaged AUC estimation. Repeated cross validation would lower the variance further, as discussed in the simulation section. The Hanley-McNeil estimate for a sample size of 50 is 0.07, which approximately matches the empirical standard deviations at this sample size. The stratification bias at this sample size for linear SVM (C = 0.01) is substantially less than the SE in this case.

Bottom Line: In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5.Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided.In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.

View Article: PubMed Central - HTML - PubMed

Affiliation: Statistical Machine Learning Group, NICTA, Canberra, Australia. brian.bj.parker@gmail.com

ABSTRACT

Background: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated.

Results: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice.

Conclusion: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.

Show MeSH
Related in: MedlinePlus