Limits...
Resampling methods improve the predictive power of modeling in class-imbalanced datasets.

Lee PH - Int J Environ Res Public Health (2014)

Bottom Line: In the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes.CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset.CARTs fitted on the oversampled (AUC = 0.70) and undersampled training data (AUC = 0.74) yielded a better classification power than that on the training data (AUC = 0.65).

View Article: PubMed Central - PubMed

Affiliation: School of Nursing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong. paul.h.lee@polyu.edu.hk.

ABSTRACT
In the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes. A dichotomized dataset is class-imbalanced if it consists mostly of one class, and performance of common classification models on this type of dataset tends to be suboptimal. To tackle such a problem, resampling methods, including oversampling and undersampling can be used. This paper aims at illustrating the effect of resampling methods using the National Health and Nutrition Examination Survey (NHANES) wave 2009-2010 dataset. A total of 4677 participants aged ≥ 20 without self-reported diabetes and with valid blood test results were analyzed. The Classification and Regression Tree (CART) procedure was used to build a classification model on undiagnosed diabetes. A participant demonstrated evidence of diabetes according to WHO diabetes criteria. Exposure variables included demographics and socio-economic status. CART models were fitted using a randomly selected 70% of the data (training dataset), and area under the receiver operating characteristic curve (AUC) was computed using the remaining 30% of the sample for evaluation (testing dataset). CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset. In addition, resampling case-to-control ratio of 1:1, 1:2, and 1:4 were examined. Resampling methods on the performance of other extensions of CART (random forests and generalized boosted trees) were also examined. CARTs fitted on the oversampled (AUC = 0.70) and undersampled training data (AUC = 0.74) yielded a better classification power than that on the training data (AUC = 0.65). Resampling could also improve the classification power of random forests and generalized boosted trees. To conclude, applying resampling methods in a class-imbalanced dataset improved the classification power of CART, random forests, and generalized boosted trees.

Show MeSH

Related in: MedlinePlus

Decision tree model fitted using the undersampled (case-to-control ratio = 1:4) training dataset (number in bold/italic/underline: sample size of the node/percentage of undiagnosed diabetes in the training dataset/percentage of undiagnosed diabetes in the testing dataset respectively).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4199049&req=5

ijerph-11-09776-f008: Decision tree model fitted using the undersampled (case-to-control ratio = 1:4) training dataset (number in bold/italic/underline: sample size of the node/percentage of undiagnosed diabetes in the training dataset/percentage of undiagnosed diabetes in the testing dataset respectively).

Mentions: The Classification and Regression Tree (CART) procedure [11] was used to build a classification model on undiagnosed diabetes. CART is a recursive partitioning procedure aim at splitting the data into distinct partitions base on the most important exposure variables determined by the procedure. A split on a partition is carried out to maximize the purity, that is, the dominance of one class, of its descendant partitions. In this study, the purity of a partition is measured by the Gini impurity, which equals where p1 and p2 are the proportions of classes 1 and 2 respectively. The model is named as a tree model as the partitions can be arranged in a tree-like structure, as shown in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8. The CART was fitted using package rpart of R, with a complexity parameter and minimum number of partition size of 0.01 and 20, respectively. We also fitted CART model with complexity parameter determined by the 1-SE rule (a standard, accepted method for complexity parameter determination) [9], the random forest model using package randomForest of R with 500 trees, and the generalized boosted trees using package gbm of R with 100 trees. Random forests and generalized boosted trees are extension of CART models by constructing multiple decision trees to improve prediction accuracy.


Resampling methods improve the predictive power of modeling in class-imbalanced datasets.

Lee PH - Int J Environ Res Public Health (2014)

Decision tree model fitted using the undersampled (case-to-control ratio = 1:4) training dataset (number in bold/italic/underline: sample size of the node/percentage of undiagnosed diabetes in the training dataset/percentage of undiagnosed diabetes in the testing dataset respectively).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4199049&req=5

ijerph-11-09776-f008: Decision tree model fitted using the undersampled (case-to-control ratio = 1:4) training dataset (number in bold/italic/underline: sample size of the node/percentage of undiagnosed diabetes in the training dataset/percentage of undiagnosed diabetes in the testing dataset respectively).
Mentions: The Classification and Regression Tree (CART) procedure [11] was used to build a classification model on undiagnosed diabetes. CART is a recursive partitioning procedure aim at splitting the data into distinct partitions base on the most important exposure variables determined by the procedure. A split on a partition is carried out to maximize the purity, that is, the dominance of one class, of its descendant partitions. In this study, the purity of a partition is measured by the Gini impurity, which equals where p1 and p2 are the proportions of classes 1 and 2 respectively. The model is named as a tree model as the partitions can be arranged in a tree-like structure, as shown in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8. The CART was fitted using package rpart of R, with a complexity parameter and minimum number of partition size of 0.01 and 20, respectively. We also fitted CART model with complexity parameter determined by the 1-SE rule (a standard, accepted method for complexity parameter determination) [9], the random forest model using package randomForest of R with 500 trees, and the generalized boosted trees using package gbm of R with 100 trees. Random forests and generalized boosted trees are extension of CART models by constructing multiple decision trees to improve prediction accuracy.

Bottom Line: In the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes.CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset.CARTs fitted on the oversampled (AUC = 0.70) and undersampled training data (AUC = 0.74) yielded a better classification power than that on the training data (AUC = 0.65).

View Article: PubMed Central - PubMed

Affiliation: School of Nursing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong. paul.h.lee@polyu.edu.hk.

ABSTRACT
In the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes. A dichotomized dataset is class-imbalanced if it consists mostly of one class, and performance of common classification models on this type of dataset tends to be suboptimal. To tackle such a problem, resampling methods, including oversampling and undersampling can be used. This paper aims at illustrating the effect of resampling methods using the National Health and Nutrition Examination Survey (NHANES) wave 2009-2010 dataset. A total of 4677 participants aged ≥ 20 without self-reported diabetes and with valid blood test results were analyzed. The Classification and Regression Tree (CART) procedure was used to build a classification model on undiagnosed diabetes. A participant demonstrated evidence of diabetes according to WHO diabetes criteria. Exposure variables included demographics and socio-economic status. CART models were fitted using a randomly selected 70% of the data (training dataset), and area under the receiver operating characteristic curve (AUC) was computed using the remaining 30% of the sample for evaluation (testing dataset). CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset. In addition, resampling case-to-control ratio of 1:1, 1:2, and 1:4 were examined. Resampling methods on the performance of other extensions of CART (random forests and generalized boosted trees) were also examined. CARTs fitted on the oversampled (AUC = 0.70) and undersampled training data (AUC = 0.74) yielded a better classification power than that on the training data (AUC = 0.65). Resampling could also improve the classification power of random forests and generalized boosted trees. To conclude, applying resampling methods in a class-imbalanced dataset improved the classification power of CART, random forests, and generalized boosted trees.

Show MeSH
Related in: MedlinePlus