Limits...
A Novel Ensemble Method for Imbalanced Data Learning: Bagging of Extrapolation-SMOTE SVM

View Article: PubMed Central - PubMed

ABSTRACT

Class imbalance ubiquitously exists in real life, which has attracted much interest from various domains. Direct learning from imbalanced dataset may pose unsatisfying results overfocusing on the accuracy of identification and deriving a suboptimal model. Various methodologies have been developed in tackling this problem including sampling, cost-sensitive, and other hybrid ones. However, the samples near the decision boundary which contain more discriminative information should be valued and the skew of the boundary would be corrected by constructing synthetic samples. Inspired by the truth and sense of geometry, we designed a new synthetic minority oversampling technique to incorporate the borderline information. What is more, ensemble model always tends to capture more complicated and robust decision boundary in practice. Taking these factors into considerations, a novel ensemble method, called Bagging of Extrapolation Borderline-SMOTE SVM (BEBS), has been proposed in dealing with imbalanced data learning (IDL) problems. Experiments on open access datasets showed significant superior performance using our model and a persuasive and intuitive explanation behind the method was illustrated. As far as we know, this is the first model combining ensemble of SVMs with borderline information for solving such condition.

No MeSH data available.


The average rankings of different algorithms on six datasets from three aspects. Rankings are integer scores as 8, 7, 6, 5, 4, 3, 2, and 1 assigned to each algorithm according to the performance.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5304315&req=5

fig4: The average rankings of different algorithms on six datasets from three aspects. Rankings are integer scores as 8, 7, 6, 5, 4, 3, 2, and 1 assigned to each algorithm according to the performance.

Mentions: We, respectively, averaged the results of G-means, F1-score, and precision in 10 independent turns. Table 4 was the final results on various dataset and the top 3 ones in each line were labeled with bold. A direct conclusion drawn from the table was BEBS, random forest, and AdaBoostM1 located in dominating board most of time and behaving stably in three metrics. Some reason accounting for this was that the requirements of careful adaptation about parameters for all the other sampling algorithms seemed crucial. However, the original SVM received worse results on dataset of Fertility, Pima, Segmentation 1, Segmentation 3, and F1-score in Parkinson was rather low. Such phenomenon verified the explanation about skew of SVM in imbalanced case. It was evident that random oversampling, random undersampling, and SMOTE-ENN were sensitive to the datasets because all of them needed parameters of manual setting according to specific case rather than adapting automatically. SMOTE outperformed these three methods but was less efficient than our proposed BEBS. Obviously, BEBS which performed well in three metrics stably benefited from both the intuitive extrapolation-SMOTE method involving boundary information and randomness from bootstrapping technique. To offer a more direct cognition, we ranked the performance of methods on testing sets in decreasing order from the perspective of G-means, F1-score, and precision. The average ranks of all algorithms on 6 datasets were shown in Figure 4. Also, taking generalization and performance into consideration, random forest and AdaBoostM1 were still worthwhile to make a trial with no additional information.


A Novel Ensemble Method for Imbalanced Data Learning: Bagging of Extrapolation-SMOTE SVM
The average rankings of different algorithms on six datasets from three aspects. Rankings are integer scores as 8, 7, 6, 5, 4, 3, 2, and 1 assigned to each algorithm according to the performance.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5304315&req=5

fig4: The average rankings of different algorithms on six datasets from three aspects. Rankings are integer scores as 8, 7, 6, 5, 4, 3, 2, and 1 assigned to each algorithm according to the performance.
Mentions: We, respectively, averaged the results of G-means, F1-score, and precision in 10 independent turns. Table 4 was the final results on various dataset and the top 3 ones in each line were labeled with bold. A direct conclusion drawn from the table was BEBS, random forest, and AdaBoostM1 located in dominating board most of time and behaving stably in three metrics. Some reason accounting for this was that the requirements of careful adaptation about parameters for all the other sampling algorithms seemed crucial. However, the original SVM received worse results on dataset of Fertility, Pima, Segmentation 1, Segmentation 3, and F1-score in Parkinson was rather low. Such phenomenon verified the explanation about skew of SVM in imbalanced case. It was evident that random oversampling, random undersampling, and SMOTE-ENN were sensitive to the datasets because all of them needed parameters of manual setting according to specific case rather than adapting automatically. SMOTE outperformed these three methods but was less efficient than our proposed BEBS. Obviously, BEBS which performed well in three metrics stably benefited from both the intuitive extrapolation-SMOTE method involving boundary information and randomness from bootstrapping technique. To offer a more direct cognition, we ranked the performance of methods on testing sets in decreasing order from the perspective of G-means, F1-score, and precision. The average ranks of all algorithms on 6 datasets were shown in Figure 4. Also, taking generalization and performance into consideration, random forest and AdaBoostM1 were still worthwhile to make a trial with no additional information.

View Article: PubMed Central - PubMed

ABSTRACT

Class imbalance ubiquitously exists in real life, which has attracted much interest from various domains. Direct learning from imbalanced dataset may pose unsatisfying results overfocusing on the accuracy of identification and deriving a suboptimal model. Various methodologies have been developed in tackling this problem including sampling, cost-sensitive, and other hybrid ones. However, the samples near the decision boundary which contain more discriminative information should be valued and the skew of the boundary would be corrected by constructing synthetic samples. Inspired by the truth and sense of geometry, we designed a new synthetic minority oversampling technique to incorporate the borderline information. What is more, ensemble model always tends to capture more complicated and robust decision boundary in practice. Taking these factors into considerations, a novel ensemble method, called Bagging of Extrapolation Borderline-SMOTE SVM (BEBS), has been proposed in dealing with imbalanced data learning (IDL) problems. Experiments on open access datasets showed significant superior performance using our model and a persuasive and intuitive explanation behind the method was illustrated. As far as we know, this is the first model combining ensemble of SVMs with borderline information for solving such condition.

No MeSH data available.