Stratification bias in low signal microarray studies.
Bottom Line:
In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5.Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided.In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.
Affiliation: Statistical Machine Learning Group, NICTA, Canberra, Australia. brian.bj.parker@gmail.com
ABSTRACT
Show MeSH
Background: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. Results: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. Conclusion: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets. Related in: MedlinePlus |
Related In:
Results -
Collection
License getmorefigures.php?uid=PMC2211509&req=5
Mentions: Figure 11(b) shows the ROC curves corresponding to figure 11(a) for 10-fold CV and a sample size of 50. The ROC curves were produced by pooling the samples of the folds of the cross-validation and computing a combined ROC curve. To generate confidence bounds, this was repeated 100 times and the vertically averaged curve displayed with standard error bars [29]. For CV the randomised dataset produces worse-than-random ROC curves with the pooling strategy, showing that the pooling strategy should also be avoided for ROC curve generation. With CV, the RBF kernel SVM shows large stratification biases, and the linear SVM using decision values shows only small biases at this sample size; BSCV applied to the linear SVM shows no substantial bias. Also shown are the results of the version of SVM returning posterior probabilities which, as expected, suffers more from stratification bias than the version using decision values. The final experiments investigated the effect of stratification bias on error rate. As above, a randomised version of the van 't Veer dataset was used. Figure 12 shows the results for both SVM kernels. BSCV is not biased and is at the expected 0.5 error rate; CV is the most pessimistically biased, and stratified CV removes most of the bias and is also at the expected 0.5 level, except for small sample sizes. The biases for the RBF kernel are especially substantial and larger than for the linear SVM. Note that stratified cross-validation still suffers from substantial bias in this case. Such large stratification biases of error rate could have an impact on model selection and evaluation, as the extent of the bias depends on the classifier type. |
View Article: PubMed Central - HTML - PubMed
Affiliation: Statistical Machine Learning Group, NICTA, Canberra, Australia. brian.bj.parker@gmail.com
Background: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated.
Results: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice.
Conclusion: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.