Stratification bias in low signal microarray studies.
Bottom Line:
In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5.Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided.In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.
Affiliation: Statistical Machine Learning Group, NICTA, Canberra, Australia. brian.bj.parker@gmail.com
ABSTRACT
Show MeSH
Background: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. Results: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. Conclusion: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets. Related in: MedlinePlus |
Related In:
Results -
Collection
License getmorefigures.php?uid=PMC2211509&req=5
Mentions: The next series of experiments used a randomised version of the van 't Veer dataset. Figure 11(a) shows the results for a linear kernel SVM using the pooling strategy for AUC estimation. Balanced, stratified CV and balanced LOOCV are approximately at the expected 0.5 line. Unstratified CV and LOOCV are pessimistically biased. Stratified CV shows some small remaining stratification bias for small sample sizes – figure 4 showed that stratified CV has some remaining covariance between training and test sizes and so this bias is expected. Note that stratified bootstrap also shows some remaining stratification bias for very small sample sizes. Although in theory stratified bootstrap has no covariance between training and test set sizes, the training set sizes are made constant by sample replication, and presumably the effect of a duplicate sample on the training of the classifier would be weaker than a truly independent pattern, and so the "effective" training set size of a class is less than the sample size would suggest (for example, consider 1-nearest neighbour: in 1-NN duplicates in the training set have absolutely no impact on the classification). Also, results for SVM with a radial basis function (RBF), aka Gaussian kernel, with parameter values C = 1 and σ2 = 0.5 are presented. Only 100 random permutations were done with this SVM type due to the computational cost, which however is sufficient to demonstrate the bias. It is known that for such kernels, which increase the effective dimensionality of the data, the SVM will underfit and tend to a majority voter in large areas of the parameter space [28], and so rely maximally upon the prior proportion information; therefore it would be expected to perform poorly and suffer maximally from stratification bias. Indeed, figure 11(a) demonstrates very large pessimistic biases. The Hanley-McNeil SE estimate for sample size 50 is 0.08; note that the stratification bias for the RBF SVM exceeds the SE, indeed, some care would be needed to avoid confusing such a large negative AUC with a genuine signal. Note that stratified CV still shows substantial biases. |
View Article: PubMed Central - HTML - PubMed
Affiliation: Statistical Machine Learning Group, NICTA, Canberra, Australia. brian.bj.parker@gmail.com
Background: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated.
Results: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice.
Conclusion: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.