Limits...
Computer-aided lung nodule recognition by SVM classifier based on combination of random undersampling and SMOTE.

Sui Y, Wei Y, Zhao D - Comput Math Methods Med (2015)

Bottom Line: However, problems of unbalanced datasets often have detrimental effects on the performance of classification.Eight features including 2D and 3D features are extracted for training and classification.Experimental results show that for different sizes of training datasets our RU-SMOTE-SVM classifier gets the highest classification accuracy among the four kinds of classifiers, and the average classification accuracy is more than 92.94%.

View Article: PubMed Central - PubMed

Affiliation: Software College, Northeastern University, Shenyang 110004, China.

ABSTRACT
In lung cancer computer-aided detection/diagnosis (CAD) systems, classification of regions of interest (ROI) is often used to detect/diagnose lung nodule accurately. However, problems of unbalanced datasets often have detrimental effects on the performance of classification. In this paper, both minority and majority classes are resampled to increase the generalization ability. We propose a novel SVM classifier combined with random undersampling (RU) and SMOTE for lung nodule recognition. The combinations of the two resampling methods not only achieve a balanced training samples but also remove noise and duplicate information in the training sample and retain useful information to improve the effective data utilization, hence improving performance of SVM algorithm for pulmonary nodules classification under the unbalanced data. Eight features including 2D and 3D features are extracted for training and classification. Experimental results show that for different sizes of training datasets our RU-SMOTE-SVM classifier gets the highest classification accuracy among the four kinds of classifiers, and the average classification accuracy is more than 92.94%.

Show MeSH

Related in: MedlinePlus

Sample xi, its K-nearest neighbors (K = 6), and the new synthetic sample by SMOTE.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4419492&req=5

fig2: Sample xi, its K-nearest neighbors (K = 6), and the new synthetic sample by SMOTE.

Mentions: Figure 2 shows an example of the process of SMOTE, in which there is a typical unbalanced data distribution, and among them circles and pentagons denote samples of minority class and majority class, respectively. In the K-nearest neighbors K = 6. Figure 1 shows the constructed new sample along the connection-line of xi and xi(t), the newly generated sample using a red solid circle to indicate it clearly. SMOTE algorithm is based on the assumption that a sample constructed between the nearby samples in the minority class is still a sample of minority class. The basic idea of SMOTE algorithm is to get synthetic samples of minority class by oversampling at the connection between the current samples of minority class. For each sample in the minority class, look for the K-nearest neighbors at its similar samples and then randomly select one of the K-nearest neighbors and construct a new artificial minority class sample between the two samples by linear interpolation method. After SMOTE processing, the number of minority class will increase K times. If more artificial minority class samples are needed, repeat the above interpolation process to achieve a balance in the new generated training samples and finally use the new sample dataset for training the classifier.


Computer-aided lung nodule recognition by SVM classifier based on combination of random undersampling and SMOTE.

Sui Y, Wei Y, Zhao D - Comput Math Methods Med (2015)

Sample xi, its K-nearest neighbors (K = 6), and the new synthetic sample by SMOTE.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4419492&req=5

fig2: Sample xi, its K-nearest neighbors (K = 6), and the new synthetic sample by SMOTE.
Mentions: Figure 2 shows an example of the process of SMOTE, in which there is a typical unbalanced data distribution, and among them circles and pentagons denote samples of minority class and majority class, respectively. In the K-nearest neighbors K = 6. Figure 1 shows the constructed new sample along the connection-line of xi and xi(t), the newly generated sample using a red solid circle to indicate it clearly. SMOTE algorithm is based on the assumption that a sample constructed between the nearby samples in the minority class is still a sample of minority class. The basic idea of SMOTE algorithm is to get synthetic samples of minority class by oversampling at the connection between the current samples of minority class. For each sample in the minority class, look for the K-nearest neighbors at its similar samples and then randomly select one of the K-nearest neighbors and construct a new artificial minority class sample between the two samples by linear interpolation method. After SMOTE processing, the number of minority class will increase K times. If more artificial minority class samples are needed, repeat the above interpolation process to achieve a balance in the new generated training samples and finally use the new sample dataset for training the classifier.

Bottom Line: However, problems of unbalanced datasets often have detrimental effects on the performance of classification.Eight features including 2D and 3D features are extracted for training and classification.Experimental results show that for different sizes of training datasets our RU-SMOTE-SVM classifier gets the highest classification accuracy among the four kinds of classifiers, and the average classification accuracy is more than 92.94%.

View Article: PubMed Central - PubMed

Affiliation: Software College, Northeastern University, Shenyang 110004, China.

ABSTRACT
In lung cancer computer-aided detection/diagnosis (CAD) systems, classification of regions of interest (ROI) is often used to detect/diagnose lung nodule accurately. However, problems of unbalanced datasets often have detrimental effects on the performance of classification. In this paper, both minority and majority classes are resampled to increase the generalization ability. We propose a novel SVM classifier combined with random undersampling (RU) and SMOTE for lung nodule recognition. The combinations of the two resampling methods not only achieve a balanced training samples but also remove noise and duplicate information in the training sample and retain useful information to improve the effective data utilization, hence improving performance of SVM algorithm for pulmonary nodules classification under the unbalanced data. Eight features including 2D and 3D features are extracted for training and classification. Experimental results show that for different sizes of training datasets our RU-SMOTE-SVM classifier gets the highest classification accuracy among the four kinds of classifiers, and the average classification accuracy is more than 92.94%.

Show MeSH
Related in: MedlinePlus