Limits...
LVQ-SMOTE - Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data.

Nakamura M, Kajiwara Y, Otsuka A, Kimura H - BioData Min (2013)

Bottom Line: In order to improve the effectiveness of SMOTE, this paper presents a novel over-sampling method using codebooks obtained by the learning vector quantization.In general, even when an existing SMOTE applied to a biomedical dataset, its empty feature space is still so huge that most classification algorithms would not perform well on estimating borderlines between classes.Experiments on datasets for β-turn types prediction show some important patterns that have not been seen in previous analyses.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Natural Science and Engineering, Kanazawa University, Ishikawa 9200941, Japan. m-nakamura@blitz.ec.t.kanazawa-u.ac.jp.

ABSTRACT

Background: Over-sampling methods based on Synthetic Minority Over-sampling Technique (SMOTE) have been proposed for classification problems of imbalanced biomedical data. However, the existing over-sampling methods achieve slightly better or sometimes worse result than the simplest SMOTE. In order to improve the effectiveness of SMOTE, this paper presents a novel over-sampling method using codebooks obtained by the learning vector quantization. In general, even when an existing SMOTE applied to a biomedical dataset, its empty feature space is still so huge that most classification algorithms would not perform well on estimating borderlines between classes. To tackle this problem, our over-sampling method generates synthetic samples which occupy more feature space than the other SMOTE algorithms. Briefly saying, our over-sampling method enables to generate useful synthetic samples by referring to actual samples taken from real-world datasets.

Results: Experiments on eight real-world imbalanced datasets demonstrate that our proposed over-sampling method performs better than the simplest SMOTE on four of five standard classification algorithms. Moreover, it is seen that the performance of our method increases if the latest SMOTE called MWMOTE is used in our algorithm. Experiments on datasets for β-turn types prediction show some important patterns that have not been seen in previous analyses.

Conclusions: The proposed over-sampling method generates useful synthetic samples for the classification of imbalanced biomedical data. Besides, the proposed over-sampling method is basically compatible with basic classification algorithms and the existing over-sampling methods.

No MeSH data available.


Example of codebooks obtained by Learning Vector Quantization. These codebooks are extracted from the samples in Iris dataset [12]. Each of the painted colored points represents the numerical value of a codebook.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4016036&req=5

Figure 1: Example of codebooks obtained by Learning Vector Quantization. These codebooks are extracted from the samples in Iris dataset [12]. Each of the painted colored points represents the numerical value of a codebook.

Mentions: LVQ is a supervised classification algorithm that has been widely used for various research purposes such as image decompression, clustering, and data visualization. LVQ is one of the neural networks modeled after the human’s visual cortex. Briefly saying, the algorithm of LVQ is a supervised version of K–means algorithm. As like K–means, the algorithm of LVQ determines a number of centroids called codebooks for each feature. Figure 1 shows an example of codebooks calculated by LVQ. The data in the figure are taken from Iris dataset (a benchmark dataset in UCI repository [12]), where the number of features is reduced from four to two by the principal component analysis. In the figure, each of the painted colored points represents the numerical value of a codebook. These codebooks are used to determine the class of an unknown sample according to the k nearest neighbor rule. Each codebook is randomly placed in the beginning and moves according to a rule based on the K–means algorithm.


LVQ-SMOTE - Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data.

Nakamura M, Kajiwara Y, Otsuka A, Kimura H - BioData Min (2013)

Example of codebooks obtained by Learning Vector Quantization. These codebooks are extracted from the samples in Iris dataset [12]. Each of the painted colored points represents the numerical value of a codebook.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4016036&req=5

Figure 1: Example of codebooks obtained by Learning Vector Quantization. These codebooks are extracted from the samples in Iris dataset [12]. Each of the painted colored points represents the numerical value of a codebook.
Mentions: LVQ is a supervised classification algorithm that has been widely used for various research purposes such as image decompression, clustering, and data visualization. LVQ is one of the neural networks modeled after the human’s visual cortex. Briefly saying, the algorithm of LVQ is a supervised version of K–means algorithm. As like K–means, the algorithm of LVQ determines a number of centroids called codebooks for each feature. Figure 1 shows an example of codebooks calculated by LVQ. The data in the figure are taken from Iris dataset (a benchmark dataset in UCI repository [12]), where the number of features is reduced from four to two by the principal component analysis. In the figure, each of the painted colored points represents the numerical value of a codebook. These codebooks are used to determine the class of an unknown sample according to the k nearest neighbor rule. Each codebook is randomly placed in the beginning and moves according to a rule based on the K–means algorithm.

Bottom Line: In order to improve the effectiveness of SMOTE, this paper presents a novel over-sampling method using codebooks obtained by the learning vector quantization.In general, even when an existing SMOTE applied to a biomedical dataset, its empty feature space is still so huge that most classification algorithms would not perform well on estimating borderlines between classes.Experiments on datasets for β-turn types prediction show some important patterns that have not been seen in previous analyses.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Natural Science and Engineering, Kanazawa University, Ishikawa 9200941, Japan. m-nakamura@blitz.ec.t.kanazawa-u.ac.jp.

ABSTRACT

Background: Over-sampling methods based on Synthetic Minority Over-sampling Technique (SMOTE) have been proposed for classification problems of imbalanced biomedical data. However, the existing over-sampling methods achieve slightly better or sometimes worse result than the simplest SMOTE. In order to improve the effectiveness of SMOTE, this paper presents a novel over-sampling method using codebooks obtained by the learning vector quantization. In general, even when an existing SMOTE applied to a biomedical dataset, its empty feature space is still so huge that most classification algorithms would not perform well on estimating borderlines between classes. To tackle this problem, our over-sampling method generates synthetic samples which occupy more feature space than the other SMOTE algorithms. Briefly saying, our over-sampling method enables to generate useful synthetic samples by referring to actual samples taken from real-world datasets.

Results: Experiments on eight real-world imbalanced datasets demonstrate that our proposed over-sampling method performs better than the simplest SMOTE on four of five standard classification algorithms. Moreover, it is seen that the performance of our method increases if the latest SMOTE called MWMOTE is used in our algorithm. Experiments on datasets for β-turn types prediction show some important patterns that have not been seen in previous analyses.

Conclusions: The proposed over-sampling method generates useful synthetic samples for the classification of imbalanced biomedical data. Besides, the proposed over-sampling method is basically compatible with basic classification algorithms and the existing over-sampling methods.

No MeSH data available.