Limits...
Staging of prostate cancer using automatic feature selection, sampling and Dempster-Shafer fusion.

Chandana S, Leung H, Trpkov K - Cancer Inform (2009)

Bottom Line: The performance of under-sampling, synthetic minority over-sampling technique (SMOTE) and a combination of the two were also investigated and the performance of the obtained models was compared.To combine the classifier outputs, we used the Dempster-Shafer (DS) theory, whereas the actual choice of combined models was made using a GA.We found that the best performance for the overall system resulted from the use of under sampled data combined with rough sets based features modeled as a support vector machine (SVM).

View Article: PubMed Central - PubMed

Affiliation: Department of Electrical and Computer Engineering, University of Calgary, Calgary, Alberta, Canada.

ABSTRACT
A novel technique of automatically selecting the best pairs of features and sampling techniques to predict the stage of prostate cancer is proposed in this study. The problem of class imbalance, which is prominent in most medical data sets is also addressed here. Three feature subsets obtained by the use of principal components analysis (PCA), genetic algorithm (GA) and rough sets (RS) based approaches were also used in the study. The performance of under-sampling, synthetic minority over-sampling technique (SMOTE) and a combination of the two were also investigated and the performance of the obtained models was compared. To combine the classifier outputs, we used the Dempster-Shafer (DS) theory, whereas the actual choice of combined models was made using a GA. We found that the best performance for the overall system resulted from the use of under sampled data combined with rough sets based features modeled as a support vector machine (SVM).

No MeSH data available.


Related in: MedlinePlus

SVM performance for different training data sizes.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC2664701&req=5

f1-cin-07-57: SVM performance for different training data sizes.

Mentions: Experiments were done to assess the performance of all of the feature extractor-sampling-classifier pairs. The available data set was divided into two: one for building the models and the other to test the developed models. Appropriate ratio of the training and testing data sizes for SVM was identified by running different trials as shown in Figure 1. The best testing performance was observed when 70% of the total samples were used for training. The dip in the performance of the SVM beyond the training data size of 70% can be attributed to overfitting, when the trained model lost its ability to generalize, and was rather rigid to the training data. As the performance of the KNN depends on the number of neighbors considered in the output class allocation, the optimum number of neighbors was identified by running trials with different sizes, and 5 was the most optimal. Although higher number of neighbors may seem to have the ability to generalize, it is the separability of the data according to assigned classes that has the highest influence on the appropriate number of neighbors.


Staging of prostate cancer using automatic feature selection, sampling and Dempster-Shafer fusion.

Chandana S, Leung H, Trpkov K - Cancer Inform (2009)

SVM performance for different training data sizes.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC2664701&req=5

f1-cin-07-57: SVM performance for different training data sizes.
Mentions: Experiments were done to assess the performance of all of the feature extractor-sampling-classifier pairs. The available data set was divided into two: one for building the models and the other to test the developed models. Appropriate ratio of the training and testing data sizes for SVM was identified by running different trials as shown in Figure 1. The best testing performance was observed when 70% of the total samples were used for training. The dip in the performance of the SVM beyond the training data size of 70% can be attributed to overfitting, when the trained model lost its ability to generalize, and was rather rigid to the training data. As the performance of the KNN depends on the number of neighbors considered in the output class allocation, the optimum number of neighbors was identified by running trials with different sizes, and 5 was the most optimal. Although higher number of neighbors may seem to have the ability to generalize, it is the separability of the data according to assigned classes that has the highest influence on the appropriate number of neighbors.

Bottom Line: The performance of under-sampling, synthetic minority over-sampling technique (SMOTE) and a combination of the two were also investigated and the performance of the obtained models was compared.To combine the classifier outputs, we used the Dempster-Shafer (DS) theory, whereas the actual choice of combined models was made using a GA.We found that the best performance for the overall system resulted from the use of under sampled data combined with rough sets based features modeled as a support vector machine (SVM).

View Article: PubMed Central - PubMed

Affiliation: Department of Electrical and Computer Engineering, University of Calgary, Calgary, Alberta, Canada.

ABSTRACT
A novel technique of automatically selecting the best pairs of features and sampling techniques to predict the stage of prostate cancer is proposed in this study. The problem of class imbalance, which is prominent in most medical data sets is also addressed here. Three feature subsets obtained by the use of principal components analysis (PCA), genetic algorithm (GA) and rough sets (RS) based approaches were also used in the study. The performance of under-sampling, synthetic minority over-sampling technique (SMOTE) and a combination of the two were also investigated and the performance of the obtained models was compared. To combine the classifier outputs, we used the Dempster-Shafer (DS) theory, whereas the actual choice of combined models was made using a GA. We found that the best performance for the overall system resulted from the use of under sampled data combined with rough sets based features modeled as a support vector machine (SVM).

No MeSH data available.


Related in: MedlinePlus