Limits...
Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization.

Cheng KO, Law NF, Siu WC, Liew AW - BMC Bioinformatics (2008)

Bottom Line: Thus, biclustering which clusters genes and conditions simultaneously is preferred over the traditional clustering technique in discovering these coherent genes.Unfortunately, many useful formulations result in NP-complete problems.Biologically significant biclusters have been validated on the yeast cell-cycle expression dataset using Gene Ontology annotations.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Information and Communication Technology, Griffith University, Gold Coast Campus, QLD 4222, Queensland, Australia. k.o.cheng@polyu.edu.hk

ABSTRACT

Background: The DNA microarray technology allows the measurement of expression levels of thousands of genes under tens/hundreds of different conditions. In microarray data, genes with similar functions usually co-express under certain conditions only 1. Thus, biclustering which clusters genes and conditions simultaneously is preferred over the traditional clustering technique in discovering these coherent genes. Various biclustering algorithms have been developed using different bicluster formulations. Unfortunately, many useful formulations result in NP-complete problems. In this article, we investigate an efficient method for identifying a popular type of biclusters called additive model. Furthermore, parallel coordinate (PC) plots are used for bicluster visualization and analysis.

Results: We develop a novel and efficient biclustering algorithm which can be regarded as a greedy version of an existing algorithm known as pCluster algorithm. By relaxing the constraint in homogeneity, the proposed algorithm has polynomial-time complexity in the worst case instead of exponential-time complexity as in the pCluster algorithm. Experiments on artificial datasets verify that our algorithm can identify both additive-related and multiplicative-related biclusters in the presence of overlap and noise. Biologically significant biclusters have been validated on the yeast cell-cycle expression dataset using Gene Ontology annotations. Comparative study shows that the proposed approach outperforms several existing biclustering algorithms. We also provide an interactive exploratory tool based on PC plot visualization for determining the parameters of our biclustering algorithm.

Conclusion: We have proposed a novel biclustering algorithm which works with PC plots for an interactive exploratory analysis of gene expression data. Experiments show that the biclustering algorithm is efficient and is capable of detecting co-regulated genes. The interactive analysis enables an optimum parameter determination in the biclustering algorithm so as to achieve the best result. In future, we will modify the proposed algorithm for other bicluster models such as the coherent evolution model.

Show MeSH

Related in: MedlinePlus

The pseudo-code of the proposed biclustering algorithm.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2396181&req=5

Figure 7: The pseudo-code of the proposed biclustering algorithm.

Mentions: An overview of the procedure of our proposed biclustering algorithm is shown in the flow chart provided in Figure 6 while the details are described in the pseudo-code in Figure 7. There are four parameters in our algorithm: noise threshold ε, minimum number of rows Nr, minimum number of columns Nc and maximum bicluster overlap in percentage Po. The parameter ε specifies the noise tolerance as well as the homogeneity in the identified biclusters. On the other hand, Nr and Nc set the lower bounds of the number of rows and columns of the identified biclusters respectively. Po determines the maximum degree of overlap between identified biclusters. More specifically, no overlap exceeding Po percentage in both the row and column dimensions simultaneously is allowed. In this paper, a bicluster with a subset of rows R and a subset of columns C is denoted by (R, C). At the beginning, the first-level difference matrix D1 is calculated for the input expression matrix E as described in line 4 in Figure 7. Supposed that E has size m rows by n columns. There is altogether n(n - 1)/2 different number of permutations so the size of D1 is m × n(n - 1)/2. In order to derive possible biclusters, a simple clustering algorithm can be applied to identify clusters for each column (lines 6–12). Let X = {x1, x2,...,xN} be a set of N expression values. By comparing xi with all values in X, a set of values Si similar to xi can be found as follows,


Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization.

Cheng KO, Law NF, Siu WC, Liew AW - BMC Bioinformatics (2008)

The pseudo-code of the proposed biclustering algorithm.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2396181&req=5

Figure 7: The pseudo-code of the proposed biclustering algorithm.
Mentions: An overview of the procedure of our proposed biclustering algorithm is shown in the flow chart provided in Figure 6 while the details are described in the pseudo-code in Figure 7. There are four parameters in our algorithm: noise threshold ε, minimum number of rows Nr, minimum number of columns Nc and maximum bicluster overlap in percentage Po. The parameter ε specifies the noise tolerance as well as the homogeneity in the identified biclusters. On the other hand, Nr and Nc set the lower bounds of the number of rows and columns of the identified biclusters respectively. Po determines the maximum degree of overlap between identified biclusters. More specifically, no overlap exceeding Po percentage in both the row and column dimensions simultaneously is allowed. In this paper, a bicluster with a subset of rows R and a subset of columns C is denoted by (R, C). At the beginning, the first-level difference matrix D1 is calculated for the input expression matrix E as described in line 4 in Figure 7. Supposed that E has size m rows by n columns. There is altogether n(n - 1)/2 different number of permutations so the size of D1 is m × n(n - 1)/2. In order to derive possible biclusters, a simple clustering algorithm can be applied to identify clusters for each column (lines 6–12). Let X = {x1, x2,...,xN} be a set of N expression values. By comparing xi with all values in X, a set of values Si similar to xi can be found as follows,

Bottom Line: Thus, biclustering which clusters genes and conditions simultaneously is preferred over the traditional clustering technique in discovering these coherent genes.Unfortunately, many useful formulations result in NP-complete problems.Biologically significant biclusters have been validated on the yeast cell-cycle expression dataset using Gene Ontology annotations.

View Article: PubMed Central - HTML - PubMed

Affiliation: School of Information and Communication Technology, Griffith University, Gold Coast Campus, QLD 4222, Queensland, Australia. k.o.cheng@polyu.edu.hk

ABSTRACT

Background: The DNA microarray technology allows the measurement of expression levels of thousands of genes under tens/hundreds of different conditions. In microarray data, genes with similar functions usually co-express under certain conditions only 1. Thus, biclustering which clusters genes and conditions simultaneously is preferred over the traditional clustering technique in discovering these coherent genes. Various biclustering algorithms have been developed using different bicluster formulations. Unfortunately, many useful formulations result in NP-complete problems. In this article, we investigate an efficient method for identifying a popular type of biclusters called additive model. Furthermore, parallel coordinate (PC) plots are used for bicluster visualization and analysis.

Results: We develop a novel and efficient biclustering algorithm which can be regarded as a greedy version of an existing algorithm known as pCluster algorithm. By relaxing the constraint in homogeneity, the proposed algorithm has polynomial-time complexity in the worst case instead of exponential-time complexity as in the pCluster algorithm. Experiments on artificial datasets verify that our algorithm can identify both additive-related and multiplicative-related biclusters in the presence of overlap and noise. Biologically significant biclusters have been validated on the yeast cell-cycle expression dataset using Gene Ontology annotations. Comparative study shows that the proposed approach outperforms several existing biclustering algorithms. We also provide an interactive exploratory tool based on PC plot visualization for determining the parameters of our biclustering algorithm.

Conclusion: We have proposed a novel biclustering algorithm which works with PC plots for an interactive exploratory analysis of gene expression data. Experiments show that the biclustering algorithm is efficient and is capable of detecting co-regulated genes. The interactive analysis enables an optimum parameter determination in the biclustering algorithm so as to achieve the best result. In future, we will modify the proposed algorithm for other bicluster models such as the coherent evolution model.

Show MeSH
Related in: MedlinePlus