Limits...
Genetic programming based ensemble system for microarray data classification.

Liu KH, Tong M, Xie ST, Yee Ng VT - Comput Math Methods Med (2015)

Bottom Line: Each individual of the GP is an ensemble system, and they become more and more accurate in the evolutionary process.The feature selection technique and balanced subsampling technique are applied to increase the diversity in each ensemble system.By using elaborate base classifiers or applying other sampling techniques, the performance of GPES may be further improved.

View Article: PubMed Central - PubMed

Affiliation: Software School of Xiamen University, Xiamen, Fujian 361005, China ; Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon 999077, Hong Kong.

ABSTRACT
Recently, more and more machine learning techniques have been applied to microarray data analysis. The aim of this study is to propose a genetic programming (GP) based new ensemble system (named GPES), which can be used to effectively classify different types of cancers. Decision trees are deployed as base classifiers in this ensemble framework with three operators: Min, Max, and Average. Each individual of the GP is an ensemble system, and they become more and more accurate in the evolutionary process. The feature selection technique and balanced subsampling technique are applied to increase the diversity in each ensemble system. The final ensemble committee is selected by a forward search algorithm, which is shown to be capable of fitting data automatically. The performance of GPES is evaluated using five binary class and six multiclass microarray datasets, and results show that the algorithm can achieve better results in most cases compared with some other ensemble systems. By using elaborate base classifiers or applying other sampling techniques, the performance of GPES may be further improved.

Show MeSH

Related in: MedlinePlus

Change of accuracy in different phases.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4355811&req=5

fig4: Change of accuracy in different phases.

Mentions: Figure 4 shows the average accuracy of GPES in different phases (see Section 3.2) in a run, on both validation sets and test sets. Note that the samples of test sets are never mixed with the training set, so the results are reliable. The changes of the accuracy in different phases on validation sets are plotted in Figure 4(a). It can be found that Phase 2 boosts the performances compared with those in Phase 1 significantly, because its functionality is to select above-average trees in the validation set. The transition from Phase 2 to Phase 3 is the GP evolutionary process. As Phase 3 represents the results of the last generation of GP, the average performances of individuals become better and better. So it can be observed that the curves ascend stably in most cases. Colon dataset is an exception. As can be found in Figure 4(a), the curve drops in Phase 3. The reason may lie in that, in Phase 3, a new training/validation set is used. The sample size of original dataset is too small, and the training set and validation set are not in the same distribution. As a result, the trained classifiers cannot fit the validation set well. Phase 4 effectively raises the curve again because only the individuals above the average performance are kept. Then, in the final forward search step (Phase 5), the accuracy curves raise a lot again for all datasets.


Genetic programming based ensemble system for microarray data classification.

Liu KH, Tong M, Xie ST, Yee Ng VT - Comput Math Methods Med (2015)

Change of accuracy in different phases.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4355811&req=5

fig4: Change of accuracy in different phases.
Mentions: Figure 4 shows the average accuracy of GPES in different phases (see Section 3.2) in a run, on both validation sets and test sets. Note that the samples of test sets are never mixed with the training set, so the results are reliable. The changes of the accuracy in different phases on validation sets are plotted in Figure 4(a). It can be found that Phase 2 boosts the performances compared with those in Phase 1 significantly, because its functionality is to select above-average trees in the validation set. The transition from Phase 2 to Phase 3 is the GP evolutionary process. As Phase 3 represents the results of the last generation of GP, the average performances of individuals become better and better. So it can be observed that the curves ascend stably in most cases. Colon dataset is an exception. As can be found in Figure 4(a), the curve drops in Phase 3. The reason may lie in that, in Phase 3, a new training/validation set is used. The sample size of original dataset is too small, and the training set and validation set are not in the same distribution. As a result, the trained classifiers cannot fit the validation set well. Phase 4 effectively raises the curve again because only the individuals above the average performance are kept. Then, in the final forward search step (Phase 5), the accuracy curves raise a lot again for all datasets.

Bottom Line: Each individual of the GP is an ensemble system, and they become more and more accurate in the evolutionary process.The feature selection technique and balanced subsampling technique are applied to increase the diversity in each ensemble system.By using elaborate base classifiers or applying other sampling techniques, the performance of GPES may be further improved.

View Article: PubMed Central - PubMed

Affiliation: Software School of Xiamen University, Xiamen, Fujian 361005, China ; Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon 999077, Hong Kong.

ABSTRACT
Recently, more and more machine learning techniques have been applied to microarray data analysis. The aim of this study is to propose a genetic programming (GP) based new ensemble system (named GPES), which can be used to effectively classify different types of cancers. Decision trees are deployed as base classifiers in this ensemble framework with three operators: Min, Max, and Average. Each individual of the GP is an ensemble system, and they become more and more accurate in the evolutionary process. The feature selection technique and balanced subsampling technique are applied to increase the diversity in each ensemble system. The final ensemble committee is selected by a forward search algorithm, which is shown to be capable of fitting data automatically. The performance of GPES is evaluated using five binary class and six multiclass microarray datasets, and results show that the algorithm can achieve better results in most cases compared with some other ensemble systems. By using elaborate base classifiers or applying other sampling techniques, the performance of GPES may be further improved.

Show MeSH
Related in: MedlinePlus