Limits...
Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data.

Wang J, Bø TH, Jonassen I, Myklebost O, Hovig E - BMC Bioinformatics (2003)

Bottom Line: The models gave class prediction with markedly reduced error rates compared to other class prediction approaches, and the importance of feature selection on microarray data analysis was also emphasized.Our models identify marker genes with predictive potential, often better than other available methods in the literature.These limitations are not specific for the classification models used.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Tumor Biology, The Norwegian Radium Hospital, N0310 Oslo, Norway. junbaiw@radium.uio.no

ABSTRACT

Background: Using DNA microarrays, we have developed two novel models for tumor classification and target gene prediction. First, gene expression profiles are summarized by optimally selected Self-Organizing Maps (SOMs), followed by tumor sample classification by Fuzzy C-means clustering. Then, the prediction of marker genes is accomplished by either manual feature selection (visualizing the weighted/mean SOM component plane) or automatic feature selection (by pair-wise Fisher's linear discriminant).

Results: The proposed models were tested on four published datasets: (1) Leukemia (2) Colon cancer (3) Brain tumors and (4) NCI cancer cell lines. The models gave class prediction with markedly reduced error rates compared to other class prediction approaches, and the importance of feature selection on microarray data analysis was also emphasized.

Conclusions: Our models identify marker genes with predictive potential, often better than other available methods in the literature. The models are potentially useful for medical diagnostics and may reveal some insights into cancer classification. Additionally, we illustrated two limitations in tumor classification from microarray data related to the biology underlying the data, in terms of (1) the class size of data, and (2) the internal structure of classes. These limitations are not specific for the classification models used.

Show MeSH

Related in: MedlinePlus

Empirical cumulative distribution of the significant scores dE. a) Leukemia data set. b) Colon data set. c) Brain tumor data set. d) NCI60 cancer cell line data set. In each plot, the percentage of F(dE) that maximizes the classification performance was marked by red smooth line.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC302113&req=5

Figure 3: Empirical cumulative distribution of the significant scores dE. a) Leukemia data set. b) Colon data set. c) Brain tumor data set. d) NCI60 cancer cell line data set. In each plot, the percentage of F(dE) that maximizes the classification performance was marked by red smooth line.

Mentions: In the following sections, we demonstrate the performance of the two suggested models using four microarray data sets: (1) leukaemia ; (2) colon cancer ; (3) brain tumors ; and (4) cancer cell lines from the NCI60 data set . All data sets are publicly available. In this work, the search of optimal number of SOM reference vectors was increased from 2 to 1120 and is demonstrated in figure (1). The feature map units selected by model one (manual feature selection) marked by light green square as shown in figure (2), and the empirical cumulative distribution of the significant score dE of feature genes (clustered in feature map units) shown in figure (3).


Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data.

Wang J, Bø TH, Jonassen I, Myklebost O, Hovig E - BMC Bioinformatics (2003)

Empirical cumulative distribution of the significant scores dE. a) Leukemia data set. b) Colon data set. c) Brain tumor data set. d) NCI60 cancer cell line data set. In each plot, the percentage of F(dE) that maximizes the classification performance was marked by red smooth line.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC302113&req=5

Figure 3: Empirical cumulative distribution of the significant scores dE. a) Leukemia data set. b) Colon data set. c) Brain tumor data set. d) NCI60 cancer cell line data set. In each plot, the percentage of F(dE) that maximizes the classification performance was marked by red smooth line.
Mentions: In the following sections, we demonstrate the performance of the two suggested models using four microarray data sets: (1) leukaemia ; (2) colon cancer ; (3) brain tumors ; and (4) cancer cell lines from the NCI60 data set . All data sets are publicly available. In this work, the search of optimal number of SOM reference vectors was increased from 2 to 1120 and is demonstrated in figure (1). The feature map units selected by model one (manual feature selection) marked by light green square as shown in figure (2), and the empirical cumulative distribution of the significant score dE of feature genes (clustered in feature map units) shown in figure (3).

Bottom Line: The models gave class prediction with markedly reduced error rates compared to other class prediction approaches, and the importance of feature selection on microarray data analysis was also emphasized.Our models identify marker genes with predictive potential, often better than other available methods in the literature.These limitations are not specific for the classification models used.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Tumor Biology, The Norwegian Radium Hospital, N0310 Oslo, Norway. junbaiw@radium.uio.no

ABSTRACT

Background: Using DNA microarrays, we have developed two novel models for tumor classification and target gene prediction. First, gene expression profiles are summarized by optimally selected Self-Organizing Maps (SOMs), followed by tumor sample classification by Fuzzy C-means clustering. Then, the prediction of marker genes is accomplished by either manual feature selection (visualizing the weighted/mean SOM component plane) or automatic feature selection (by pair-wise Fisher's linear discriminant).

Results: The proposed models were tested on four published datasets: (1) Leukemia (2) Colon cancer (3) Brain tumors and (4) NCI cancer cell lines. The models gave class prediction with markedly reduced error rates compared to other class prediction approaches, and the importance of feature selection on microarray data analysis was also emphasized.

Conclusions: Our models identify marker genes with predictive potential, often better than other available methods in the literature. The models are potentially useful for medical diagnostics and may reveal some insights into cancer classification. Additionally, we illustrated two limitations in tumor classification from microarray data related to the biology underlying the data, in terms of (1) the class size of data, and (2) the internal structure of classes. These limitations are not specific for the classification models used.

Show MeSH
Related in: MedlinePlus