Limits...
Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data.

Wang J, Bø TH, Jonassen I, Myklebost O, Hovig E - BMC Bioinformatics (2003)

Bottom Line: The models gave class prediction with markedly reduced error rates compared to other class prediction approaches, and the importance of feature selection on microarray data analysis was also emphasized.Our models identify marker genes with predictive potential, often better than other available methods in the literature.These limitations are not specific for the classification models used.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Tumor Biology, The Norwegian Radium Hospital, N0310 Oslo, Norway. junbaiw@radium.uio.no

ABSTRACT

Background: Using DNA microarrays, we have developed two novel models for tumor classification and target gene prediction. First, gene expression profiles are summarized by optimally selected Self-Organizing Maps (SOMs), followed by tumor sample classification by Fuzzy C-means clustering. Then, the prediction of marker genes is accomplished by either manual feature selection (visualizing the weighted/mean SOM component plane) or automatic feature selection (by pair-wise Fisher's linear discriminant).

Results: The proposed models were tested on four published datasets: (1) Leukemia (2) Colon cancer (3) Brain tumors and (4) NCI cancer cell lines. The models gave class prediction with markedly reduced error rates compared to other class prediction approaches, and the importance of feature selection on microarray data analysis was also emphasized.

Conclusions: Our models identify marker genes with predictive potential, often better than other available methods in the literature. The models are potentially useful for medical diagnostics and may reveal some insights into cancer classification. Additionally, we illustrated two limitations in tumor classification from microarray data related to the biology underlying the data, in terms of (1) the class size of data, and (2) the internal structure of classes. These limitations are not specific for the classification models used.

Show MeSH

Related in: MedlinePlus

Diagrams of proposed two classifier models. a) The model one with the manual feature selection. b) The model two with the automatic feature selection.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC302113&req=5

Figure 5: Diagrams of proposed two classifier models. a) The model one with the manual feature selection. b) The model two with the automatic feature selection.

Mentions: In microarray data analysis, a more ambitious, difficult, and potentially useful computational problem than clustering, i.e. classifier design, refers to the identification of a few typical genes from all available gene expression profiles. Once they are defined, a classifier is capable of labeling every tumor sample in the entire sample collection. Sometimes this is termed as supervised learning (in this context we are learning the genes' biological contribution in each type of tumor). By the combination of above three techniques (optimally selected SOM, FCC and PFLD), we have created two types of classifier models. Model one is implemented with manual feature selection and model two is applied with automatic feature selection to predict the marker gene of each type of tumor class. The detailed illustration of these models is shown in figure (5). Some features of the proposed models will be explained here: First, the preprocessing of microarray data was essential in that different choices may affect the outcome of comparison. Thus, we followed exactly the preprocessing protocol in [5], i.e. thresholding, filtering, a logarithmic transformation, and a standardization of each dataset that enables us to have a fair comparison with other methods. After the preprocessing, each dataset was subjected to model one and model two (see figure (5) for the further details), where no preprocessing steps were involved in the cross validation. Secondly, for both models, the marker genes obtained from each run will subsequently be used to predict class labels of the test dataset (randomly selecting 1/3 of all learning samples) and to calculate the test-set error Terror. Finally, for a possible comparison between two proposed models, the number of feature map units (m) used by the automatic feature selection (model two) is defined as m = number of tumor classes times β, where β is a parameter that leads m has the closest value to the size of feature map units that were identified by manual feature selection (model one).


Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data.

Wang J, Bø TH, Jonassen I, Myklebost O, Hovig E - BMC Bioinformatics (2003)

Diagrams of proposed two classifier models. a) The model one with the manual feature selection. b) The model two with the automatic feature selection.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC302113&req=5

Figure 5: Diagrams of proposed two classifier models. a) The model one with the manual feature selection. b) The model two with the automatic feature selection.
Mentions: In microarray data analysis, a more ambitious, difficult, and potentially useful computational problem than clustering, i.e. classifier design, refers to the identification of a few typical genes from all available gene expression profiles. Once they are defined, a classifier is capable of labeling every tumor sample in the entire sample collection. Sometimes this is termed as supervised learning (in this context we are learning the genes' biological contribution in each type of tumor). By the combination of above three techniques (optimally selected SOM, FCC and PFLD), we have created two types of classifier models. Model one is implemented with manual feature selection and model two is applied with automatic feature selection to predict the marker gene of each type of tumor class. The detailed illustration of these models is shown in figure (5). Some features of the proposed models will be explained here: First, the preprocessing of microarray data was essential in that different choices may affect the outcome of comparison. Thus, we followed exactly the preprocessing protocol in [5], i.e. thresholding, filtering, a logarithmic transformation, and a standardization of each dataset that enables us to have a fair comparison with other methods. After the preprocessing, each dataset was subjected to model one and model two (see figure (5) for the further details), where no preprocessing steps were involved in the cross validation. Secondly, for both models, the marker genes obtained from each run will subsequently be used to predict class labels of the test dataset (randomly selecting 1/3 of all learning samples) and to calculate the test-set error Terror. Finally, for a possible comparison between two proposed models, the number of feature map units (m) used by the automatic feature selection (model two) is defined as m = number of tumor classes times β, where β is a parameter that leads m has the closest value to the size of feature map units that were identified by manual feature selection (model one).

Bottom Line: The models gave class prediction with markedly reduced error rates compared to other class prediction approaches, and the importance of feature selection on microarray data analysis was also emphasized.Our models identify marker genes with predictive potential, often better than other available methods in the literature.These limitations are not specific for the classification models used.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Tumor Biology, The Norwegian Radium Hospital, N0310 Oslo, Norway. junbaiw@radium.uio.no

ABSTRACT

Background: Using DNA microarrays, we have developed two novel models for tumor classification and target gene prediction. First, gene expression profiles are summarized by optimally selected Self-Organizing Maps (SOMs), followed by tumor sample classification by Fuzzy C-means clustering. Then, the prediction of marker genes is accomplished by either manual feature selection (visualizing the weighted/mean SOM component plane) or automatic feature selection (by pair-wise Fisher's linear discriminant).

Results: The proposed models were tested on four published datasets: (1) Leukemia (2) Colon cancer (3) Brain tumors and (4) NCI cancer cell lines. The models gave class prediction with markedly reduced error rates compared to other class prediction approaches, and the importance of feature selection on microarray data analysis was also emphasized.

Conclusions: Our models identify marker genes with predictive potential, often better than other available methods in the literature. The models are potentially useful for medical diagnostics and may reveal some insights into cancer classification. Additionally, we illustrated two limitations in tumor classification from microarray data related to the biology underlying the data, in terms of (1) the class size of data, and (2) the internal structure of classes. These limitations are not specific for the classification models used.

Show MeSH
Related in: MedlinePlus