Limits...
A comparison of four clustering methods for brain expression microarray data.

Richards AL, Holmans P, O'Donovan MC, Owen MJ, Jones L - BMC Bioinformatics (2008)

Bottom Line: The results are compared on speed, gene coverage and GO enrichment.Combining clusters produced by k-means and memISA or ISA leads to increased GO enrichment and number of clusters produced (compared to k-means alone), without negatively impacting gene coverage. memISA can also find potentially disease-related clusters.In two independent dorsolateral prefrontal cortex datasets, it finds three overlapping clusters that are either enriched for genes associated with schizophrenia, genes differentially expressed in schizophrenia, or both.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Psychological Medicine, School of Medicine, University Hospital Wales, Heath Park, Cardiff, Wales, UK. richardsal1@cardiff.ac.uk

ABSTRACT

Background: DNA microarrays, which determine the expression levels of tens of thousands of genes from a sample, are an important research tool. However, the volume of data they produce can be an obstacle to interpretation of the results. Clustering the genes on the basis of similarity of their expression profiles can simplify the data, and potentially provides an important source of biological inference, but these methods have not been tested systematically on datasets from complex human tissues. In this paper, four clustering methods, CRC, k-means, ISA and memISA, are used upon three brain expression datasets. The results are compared on speed, gene coverage and GO enrichment. The effects of combining the clusters produced by each method are also assessed.

Results: k-means outperforms the other methods, with 100% gene coverage and GO enrichments only slightly exceeded by memISA and ISA. Those two methods produce greater GO enrichments on the datasets used, but at the cost of much lower gene coverage, fewer clusters produced, and speed. The clusters they find are largely different to those produced by k-means. Combining clusters produced by k-means and memISA or ISA leads to increased GO enrichment and number of clusters produced (compared to k-means alone), without negatively impacting gene coverage. memISA can also find potentially disease-related clusters. In two independent dorsolateral prefrontal cortex datasets, it finds three overlapping clusters that are either enriched for genes associated with schizophrenia, genes differentially expressed in schizophrenia, or both. Two of these clusters are enriched for genes of the MAP kinase pathway, suggesting a possible role for this pathway in the aetiology of schizophrenia.

Conclusion: Considered alone, k-means clustering is the most effective of the four methods on typical microarray brain expression datasets. However, memISA and ISA can add extra high-quality clusters to the set produced by k-means, so combining these three methods is the method of choice.

Show MeSH

Related in: MedlinePlus

GO enrichment and gene coverage of clusters for all methods – MC66 dataset. Orange, green, yellow and light blue bars are the percentage of clusters that are significantly enriched for one or more GO categories at p < 0.05, 0.01, 0.001 and 0.0001 respectively. Dark blue bar is gene coverage, the percentage of genes available on the chip that are assigned to at least one cluster. Numbers in square brackets are the number of clusters produced by that method. 'Dist.' = distribution of sizes. Parameter set A for CRC is 10 chains and 20 iterations per chain. Parameter set B for CRC is 20 chains and 40 iterations per chain.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2655095&req=5

Figure 6: GO enrichment and gene coverage of clusters for all methods – MC66 dataset. Orange, green, yellow and light blue bars are the percentage of clusters that are significantly enriched for one or more GO categories at p < 0.05, 0.01, 0.001 and 0.0001 respectively. Dark blue bar is gene coverage, the percentage of genes available on the chip that are assigned to at least one cluster. Numbers in square brackets are the number of clusters produced by that method. 'Dist.' = distribution of sizes. Parameter set A for CRC is 10 chains and 20 iterations per chain. Parameter set B for CRC is 20 chains and 40 iterations per chain.

Mentions: All four methods performed better than the random cluster sets when examined using GO enrichment to represent known biological relationships (Figs. 5, 6, 7). This implies that all the clustering methods result in groupings of biological significance. Of the three random cluster sets, those with the same size distribution as ISA had slightly lower GO enrichment than those with the same size distribution as memISA or CRC. This may suggest that GO enrichment has a small bias against ISA due to the sizes of clusters it produces. However, at p < 0.05 the difference dropped to under 1% GO enrichment, suggesting that any such bias is extremely slight and may well be due to chance.


A comparison of four clustering methods for brain expression microarray data.

Richards AL, Holmans P, O'Donovan MC, Owen MJ, Jones L - BMC Bioinformatics (2008)

GO enrichment and gene coverage of clusters for all methods – MC66 dataset. Orange, green, yellow and light blue bars are the percentage of clusters that are significantly enriched for one or more GO categories at p < 0.05, 0.01, 0.001 and 0.0001 respectively. Dark blue bar is gene coverage, the percentage of genes available on the chip that are assigned to at least one cluster. Numbers in square brackets are the number of clusters produced by that method. 'Dist.' = distribution of sizes. Parameter set A for CRC is 10 chains and 20 iterations per chain. Parameter set B for CRC is 20 chains and 40 iterations per chain.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2655095&req=5

Figure 6: GO enrichment and gene coverage of clusters for all methods – MC66 dataset. Orange, green, yellow and light blue bars are the percentage of clusters that are significantly enriched for one or more GO categories at p < 0.05, 0.01, 0.001 and 0.0001 respectively. Dark blue bar is gene coverage, the percentage of genes available on the chip that are assigned to at least one cluster. Numbers in square brackets are the number of clusters produced by that method. 'Dist.' = distribution of sizes. Parameter set A for CRC is 10 chains and 20 iterations per chain. Parameter set B for CRC is 20 chains and 40 iterations per chain.
Mentions: All four methods performed better than the random cluster sets when examined using GO enrichment to represent known biological relationships (Figs. 5, 6, 7). This implies that all the clustering methods result in groupings of biological significance. Of the three random cluster sets, those with the same size distribution as ISA had slightly lower GO enrichment than those with the same size distribution as memISA or CRC. This may suggest that GO enrichment has a small bias against ISA due to the sizes of clusters it produces. However, at p < 0.05 the difference dropped to under 1% GO enrichment, suggesting that any such bias is extremely slight and may well be due to chance.

Bottom Line: The results are compared on speed, gene coverage and GO enrichment.Combining clusters produced by k-means and memISA or ISA leads to increased GO enrichment and number of clusters produced (compared to k-means alone), without negatively impacting gene coverage. memISA can also find potentially disease-related clusters.In two independent dorsolateral prefrontal cortex datasets, it finds three overlapping clusters that are either enriched for genes associated with schizophrenia, genes differentially expressed in schizophrenia, or both.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Psychological Medicine, School of Medicine, University Hospital Wales, Heath Park, Cardiff, Wales, UK. richardsal1@cardiff.ac.uk

ABSTRACT

Background: DNA microarrays, which determine the expression levels of tens of thousands of genes from a sample, are an important research tool. However, the volume of data they produce can be an obstacle to interpretation of the results. Clustering the genes on the basis of similarity of their expression profiles can simplify the data, and potentially provides an important source of biological inference, but these methods have not been tested systematically on datasets from complex human tissues. In this paper, four clustering methods, CRC, k-means, ISA and memISA, are used upon three brain expression datasets. The results are compared on speed, gene coverage and GO enrichment. The effects of combining the clusters produced by each method are also assessed.

Results: k-means outperforms the other methods, with 100% gene coverage and GO enrichments only slightly exceeded by memISA and ISA. Those two methods produce greater GO enrichments on the datasets used, but at the cost of much lower gene coverage, fewer clusters produced, and speed. The clusters they find are largely different to those produced by k-means. Combining clusters produced by k-means and memISA or ISA leads to increased GO enrichment and number of clusters produced (compared to k-means alone), without negatively impacting gene coverage. memISA can also find potentially disease-related clusters. In two independent dorsolateral prefrontal cortex datasets, it finds three overlapping clusters that are either enriched for genes associated with schizophrenia, genes differentially expressed in schizophrenia, or both. Two of these clusters are enriched for genes of the MAP kinase pathway, suggesting a possible role for this pathway in the aetiology of schizophrenia.

Conclusion: Considered alone, k-means clustering is the most effective of the four methods on typical microarray brain expression datasets. However, memISA and ISA can add extra high-quality clusters to the set produced by k-means, so combining these three methods is the method of choice.

Show MeSH
Related in: MedlinePlus