Limits...
Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes.

Maulik U, Mukhopadhyay A, Bandyopadhyay S - BMC Bioinformatics (2009)

Bottom Line: Motivated by this fact, a novel fuzzy majority voting approach is proposed to combine the clustering information from all the solutions in the resultant Pareto-optimal set.Moreover, statistical significance tests have been carried out to establish the statistical superiority of the proposed clustering approach.The clusters of genes produced by the proposed technique are also found to be biologically significant, i.e., consist of genes which belong to the same functional groups.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science and Engineering, Jadavpur University, Kolkata, India. drumaulik@cse.jdvu.ac.in

ABSTRACT

Background: The landscape of biological and biomedical research is being changed rapidly with the invention of microarrays which enables simultaneous view on the transcription levels of a huge number of genes across different experimental conditions or time points. Using microarray data sets, clustering algorithms have been actively utilized in order to identify groups of co-expressed genes. This article poses the problem of fuzzy clustering in microarray data as a multiobjective optimization problem which simultaneously optimizes two internal fuzzy cluster validity indices to yield a set of Pareto-optimal clustering solutions. Each of these clustering solutions possesses some amount of information regarding the clustering structure of the input data. Motivated by this fact, a novel fuzzy majority voting approach is proposed to combine the clustering information from all the solutions in the resultant Pareto-optimal set. This approach first identifies the genes which are assigned to some particular cluster with high membership degree by most of the Pareto-optimal solutions. Using this set of genes as the training set, the remaining genes are classified by a supervised learning algorithm. In this work, we have used a Support Vector Machine (SVM) classifier for this purpose.

Results: The performance of the proposed clustering technique has been demonstrated on five publicly available benchmark microarray data sets, viz., Yeast Sporulation, Yeast Cell Cycle, Arabidopsis Thaliana, Human Fibroblasts Serum and Rat Central Nervous System. Comparative studies of the use of different SVM kernels and several widely used microarray clustering techniques are reported. Moreover, statistical significance tests have been carried out to establish the statistical superiority of the proposed clustering approach. Finally, biological significance tests have been carried out using a web based gene annotation tool to show that the proposed method is able to produce biologically relevant clusters of co-expressed genes.

Conclusion: The proposed clustering method has been shown to perform better than other well-known clustering algorithms in finding clusters of co-expressed genes efficiently. The clusters of genes produced by the proposed technique are also found to be biologically significant, i.e., consist of genes which belong to the same functional groups. This indicates that the proposed clustering method can be used efficiently to identify co-expressed genes in microarray gene expression data.Supplementary Website The pre-processed and normalized data sets, the matlab code and other related materials are available at http://anirbanmukhopadhyay.50webs.com/mogasvm.html.

Show MeSH

Related in: MedlinePlus

Boxplots of the p-values of the most significant GO terms of all the clusters having at least one significant GO term as obtained by different algorithms for Yeast Sporulation data. The p-values are log-transformed for better readability.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2657792&req=5

Figure 7: Boxplots of the p-values of the most significant GO terms of all the clusters having at least one significant GO term as obtained by different algorithms for Yeast Sporulation data. The p-values are log-transformed for better readability.

Mentions: The biological significance test for Yeast Sporulation data has been conducted at the 1% significance level. For different algorithms, the number of clusters for which the most significant GO terms have a p-value less than 0.01 (1% significance level) are as follows: MOGA-SVM – 6, MOGA (without SVM) – 6, MOGAcrisp-SVM (RBF) – 6, FCM – 4, SGA – 6, Average linkage – 4, SOM – 4 and CRC – 6. In Fig. 7, the boxplots of the p-values of the most significant GO terms of all the clusters having at least one significant GO term as obtained by the different algorithms are shown. The p-values are log-transformed for better readability. It is evident from the figure that the boxplot corresponding to MOGA-SVM method has lower p-values (i.e., higher -log10 (p-value)). This indicates that the clusters identified by MOGA-SVM are more biologically significant and functionally enriched compared to the other algorithms.


Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes.

Maulik U, Mukhopadhyay A, Bandyopadhyay S - BMC Bioinformatics (2009)

Boxplots of the p-values of the most significant GO terms of all the clusters having at least one significant GO term as obtained by different algorithms for Yeast Sporulation data. The p-values are log-transformed for better readability.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2657792&req=5

Figure 7: Boxplots of the p-values of the most significant GO terms of all the clusters having at least one significant GO term as obtained by different algorithms for Yeast Sporulation data. The p-values are log-transformed for better readability.
Mentions: The biological significance test for Yeast Sporulation data has been conducted at the 1% significance level. For different algorithms, the number of clusters for which the most significant GO terms have a p-value less than 0.01 (1% significance level) are as follows: MOGA-SVM – 6, MOGA (without SVM) – 6, MOGAcrisp-SVM (RBF) – 6, FCM – 4, SGA – 6, Average linkage – 4, SOM – 4 and CRC – 6. In Fig. 7, the boxplots of the p-values of the most significant GO terms of all the clusters having at least one significant GO term as obtained by the different algorithms are shown. The p-values are log-transformed for better readability. It is evident from the figure that the boxplot corresponding to MOGA-SVM method has lower p-values (i.e., higher -log10 (p-value)). This indicates that the clusters identified by MOGA-SVM are more biologically significant and functionally enriched compared to the other algorithms.

Bottom Line: Motivated by this fact, a novel fuzzy majority voting approach is proposed to combine the clustering information from all the solutions in the resultant Pareto-optimal set.Moreover, statistical significance tests have been carried out to establish the statistical superiority of the proposed clustering approach.The clusters of genes produced by the proposed technique are also found to be biologically significant, i.e., consist of genes which belong to the same functional groups.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Computer Science and Engineering, Jadavpur University, Kolkata, India. drumaulik@cse.jdvu.ac.in

ABSTRACT

Background: The landscape of biological and biomedical research is being changed rapidly with the invention of microarrays which enables simultaneous view on the transcription levels of a huge number of genes across different experimental conditions or time points. Using microarray data sets, clustering algorithms have been actively utilized in order to identify groups of co-expressed genes. This article poses the problem of fuzzy clustering in microarray data as a multiobjective optimization problem which simultaneously optimizes two internal fuzzy cluster validity indices to yield a set of Pareto-optimal clustering solutions. Each of these clustering solutions possesses some amount of information regarding the clustering structure of the input data. Motivated by this fact, a novel fuzzy majority voting approach is proposed to combine the clustering information from all the solutions in the resultant Pareto-optimal set. This approach first identifies the genes which are assigned to some particular cluster with high membership degree by most of the Pareto-optimal solutions. Using this set of genes as the training set, the remaining genes are classified by a supervised learning algorithm. In this work, we have used a Support Vector Machine (SVM) classifier for this purpose.

Results: The performance of the proposed clustering technique has been demonstrated on five publicly available benchmark microarray data sets, viz., Yeast Sporulation, Yeast Cell Cycle, Arabidopsis Thaliana, Human Fibroblasts Serum and Rat Central Nervous System. Comparative studies of the use of different SVM kernels and several widely used microarray clustering techniques are reported. Moreover, statistical significance tests have been carried out to establish the statistical superiority of the proposed clustering approach. Finally, biological significance tests have been carried out using a web based gene annotation tool to show that the proposed method is able to produce biologically relevant clusters of co-expressed genes.

Conclusion: The proposed clustering method has been shown to perform better than other well-known clustering algorithms in finding clusters of co-expressed genes efficiently. The clusters of genes produced by the proposed technique are also found to be biologically significant, i.e., consist of genes which belong to the same functional groups. This indicates that the proposed clustering method can be used efficiently to identify co-expressed genes in microarray gene expression data.Supplementary Website The pre-processed and normalized data sets, the matlab code and other related materials are available at http://anirbanmukhopadhyay.50webs.com/mogasvm.html.

Show MeSH
Related in: MedlinePlus