Limits...
Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes.

Datta S, Datta S - BMC Bioinformatics (2006)

Bottom Line: While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading.As the name suggests, it is a measure of how biologically homogeneous the clusters are.Two separate choices of the functional classes were used for this data set and the results were compared for consistency.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40202, USA. susmita.datta@louisville.edu

ABSTRACT

Background: A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional classes. Such a reference set may come from prior biological knowledge specific to a microarray study or may be formed using the growing databases of gene ontologies (GO) for the annotated genes of the relevant species.

Results: In this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are. This can be used to quantify the performance of a given clustering algorithm such as UPGMA in grouping genes for a particular data set and also for comparing the performance of a number of competing clustering algorithms applied to the same data set. The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithm's ability to produce biologically meaningful clusters when applied repeatedly to similar data sets. A good clustering algorithm should have high BHI and moderate to high BSI. We evaluated the performance of ten well known clustering algorithms on two gene expression data sets and identified the optimal algorithm in each case. The first data set deals with SAGE profiles of differentially expressed tags between normal and ductal carcinoma in situ samples of breast cancer patients. The second data set contains the expression profiles over time of positively expressed genes (ORF's) during sporulation of budding yeast. Two separate choices of the functional classes were used for this data set and the results were compared for consistency.

Conclusion: Functional information of annotated genes available from various GO databases mined using ontology tools can be used to systematically judge the results of an unsupervised clustering algorithm as applied to a gene expression data set in clustering genes. This information could be used to select the right algorithm from a class of clustering algorithms for the given data set.

Show MeSH

Related in: MedlinePlus

BHI for various clustering algorithms applied to the normal and DCIS samples in breast cancer data. The thick black line is the 95th percentile of BHI values under random clustering.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1590054&req=5

Figure 1: BHI for various clustering algorithms applied to the normal and DCIS samples in breast cancer data. The thick black line is the 95th percentile of BHI values under random clustering.

Mentions: We first consider the breast cancer data. This data set consisted of expression profiles of 258 significant genes based on their eleven dimensional expression profiles over four normal and seven DCIS samples. Based on the size of the data set we judge that a cluster size between four and ten might be appropriate. Thus, both the biological homogeneity index (BHI) and the biological stability index (BSI) was computed for each clustering algorithm in this range of cluster numbers. As described in the Methods section, we used eleven functional classes for this study. Figure 1 shows the plots of BHI for the ten clustering strategies along with the results for random clustering. The thick black piecewise linear curve denotes the 95-th percentiles of the BHI values obtained by random clustering – these are computed by a Monte carlo scheme as described in the methods section based on 500 iterates. Thus, the probability of obtaining a value of BHI as high as that just by chance is estimated to be less than 5%. Therefore, any score higher than the thick black curve by a clustering algorithm will be judged to be "statistically significant".


Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes.

Datta S, Datta S - BMC Bioinformatics (2006)

BHI for various clustering algorithms applied to the normal and DCIS samples in breast cancer data. The thick black line is the 95th percentile of BHI values under random clustering.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1590054&req=5

Figure 1: BHI for various clustering algorithms applied to the normal and DCIS samples in breast cancer data. The thick black line is the 95th percentile of BHI values under random clustering.
Mentions: We first consider the breast cancer data. This data set consisted of expression profiles of 258 significant genes based on their eleven dimensional expression profiles over four normal and seven DCIS samples. Based on the size of the data set we judge that a cluster size between four and ten might be appropriate. Thus, both the biological homogeneity index (BHI) and the biological stability index (BSI) was computed for each clustering algorithm in this range of cluster numbers. As described in the Methods section, we used eleven functional classes for this study. Figure 1 shows the plots of BHI for the ten clustering strategies along with the results for random clustering. The thick black piecewise linear curve denotes the 95-th percentiles of the BHI values obtained by random clustering – these are computed by a Monte carlo scheme as described in the methods section based on 500 iterates. Thus, the probability of obtaining a value of BHI as high as that just by chance is estimated to be less than 5%. Therefore, any score higher than the thick black curve by a clustering algorithm will be judged to be "statistically significant".

Bottom Line: While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading.As the name suggests, it is a measure of how biologically homogeneous the clusters are.Two separate choices of the functional classes were used for this data set and the results were compared for consistency.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40202, USA. susmita.datta@louisville.edu

ABSTRACT

Background: A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional classes. Such a reference set may come from prior biological knowledge specific to a microarray study or may be formed using the growing databases of gene ontologies (GO) for the annotated genes of the relevant species.

Results: In this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are. This can be used to quantify the performance of a given clustering algorithm such as UPGMA in grouping genes for a particular data set and also for comparing the performance of a number of competing clustering algorithms applied to the same data set. The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithm's ability to produce biologically meaningful clusters when applied repeatedly to similar data sets. A good clustering algorithm should have high BHI and moderate to high BSI. We evaluated the performance of ten well known clustering algorithms on two gene expression data sets and identified the optimal algorithm in each case. The first data set deals with SAGE profiles of differentially expressed tags between normal and ductal carcinoma in situ samples of breast cancer patients. The second data set contains the expression profiles over time of positively expressed genes (ORF's) during sporulation of budding yeast. Two separate choices of the functional classes were used for this data set and the results were compared for consistency.

Conclusion: Functional information of annotated genes available from various GO databases mined using ontology tools can be used to systematically judge the results of an unsupervised clustering algorithm as applied to a gene expression data set in clustering genes. This information could be used to select the right algorithm from a class of clustering algorithms for the given data set.

Show MeSH
Related in: MedlinePlus