Limits...
Application of gene shaving and mixture models to cluster microarray gene expression data.

Do KA, McLachlan GJ, Bean R, Wen S - Cancer Inform (2007)

Bottom Line: Although the two approaches have been developed from different perspectives, the results demonstrate a clear correspondence between gene clusters produced by GeneClust and EMMIX-GENE for the colon tissue data.The intent of the EMMIX-GENE method is to cluster the tissue samples.It performs a filtering step that results in a subset of relevant genes, followed by gene clustering, and then tissue clustering, and is favorable in its accuracy of ranking the clusters produced.

View Article: PubMed Central - PubMed

Affiliation: University of Texas, M.D. Anderson Cancer Center, Houston, Texas, USA. kim@mdanderson.org

ABSTRACT
Researchers are frequently faced with the analysis of microarray data of a relatively large number of genes using a small number of tissue samples. We examine the application of two statistical methods for clustering such microarray expression data: EMMIX-GENE and GeneClust. EMMIX-GENE is a mixture-model based clustering approach, designed primarily to cluster tissue samples on the basis of the genes. GeneClust is an implementation of the gene shaving methodology, motivated by research to identify distinct sets of genes for which variation in expression could be related to a biological property of the tissue samples. We illustrate the use of these two methods in the analysis of Affymetrix oligonucleotide arrays of well-known data sets from colon tissue samples with and without tumors, and of tumor tissue samples from patients with leukemia. Although the two approaches have been developed from different perspectives, the results demonstrate a clear correspondence between gene clusters produced by GeneClust and EMMIX-GENE for the colon tissue data. It is demonstrated, for the case of ribosomal proteins and smooth muscle genes in the colon data set, that both methods can classify genes into co-regulated families. It is further demonstrated that tissue types (tumor and normal) can be separated on the basis of subtle distributed patterns of genes. Application to the leukemia tissue data produces a division of tissues corresponding closely to the external classification, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), for both methods. In addition, we also identify genes specific for the subgroup of ALL-Tcell samples. Overall, we find that the gene shaving method produces gene clusters at great speed; allows variable cluster sizes and can incorporate partial or full supervision; and finds clusters of genes in which the gene expression varies greatly over the tissue samples while maintaining a high level of coherence between the gene expression profiles. The intent of the EMMIX-GENE method is to cluster the tissue samples. It performs a filtering step that results in a subset of relevant genes, followed by gene clustering, and then tissue clustering, and is favorable in its accuracy of ranking the clusters produced.

No MeSH data available.


Related in: MedlinePlus

Analysis of the Alon data set (2,000 genes) under full supervision. Heat maps of the first two gene shaving clusters for the colon data with full supervision; the samples are sorted by the column mean gene.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC2666952&req=5

f6a-cin-05-25: Analysis of the Alon data set (2,000 genes) under full supervision. Heat maps of the first two gene shaving clusters for the colon data with full supervision; the samples are sorted by the column mean gene.

Mentions: We also reanalyzed the full Alon data set with different levels of supervision ranging from 10% to 100% supervision using the external classification of tumor versus normal. With 50% supervision, the first four gene clusters are presented in Figure 5. The first cluster (samples are not reordered) shows 50 genes (including the two smooth-muscle genes J02854 and T60155) representing two distinct groups of negatively correlated genes that correspond well to the external classification. The third cluster of 5 genes (sorted by the column means of the cluster) group the tissues according to the old versus new protocols. When 100% supervision is used (Figure 6), the most coherent cluster that correspond to the external classification consists of 9 genes and classifies the tumors and normals with an error rate of 6 (Rand index of 0.82), as found by other methods. These nine genes also correspond to those with the top TNoM scores used by Ben-Dor et al. (2000). TnoM is the threshold number of misclassification which measures the “relevance” of a gene. Inspection of the variance and Gap plots under the full supervised scenario indicates that only the first cluster captures the full external classification.


Application of gene shaving and mixture models to cluster microarray gene expression data.

Do KA, McLachlan GJ, Bean R, Wen S - Cancer Inform (2007)

Analysis of the Alon data set (2,000 genes) under full supervision. Heat maps of the first two gene shaving clusters for the colon data with full supervision; the samples are sorted by the column mean gene.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC2666952&req=5

f6a-cin-05-25: Analysis of the Alon data set (2,000 genes) under full supervision. Heat maps of the first two gene shaving clusters for the colon data with full supervision; the samples are sorted by the column mean gene.
Mentions: We also reanalyzed the full Alon data set with different levels of supervision ranging from 10% to 100% supervision using the external classification of tumor versus normal. With 50% supervision, the first four gene clusters are presented in Figure 5. The first cluster (samples are not reordered) shows 50 genes (including the two smooth-muscle genes J02854 and T60155) representing two distinct groups of negatively correlated genes that correspond well to the external classification. The third cluster of 5 genes (sorted by the column means of the cluster) group the tissues according to the old versus new protocols. When 100% supervision is used (Figure 6), the most coherent cluster that correspond to the external classification consists of 9 genes and classifies the tumors and normals with an error rate of 6 (Rand index of 0.82), as found by other methods. These nine genes also correspond to those with the top TNoM scores used by Ben-Dor et al. (2000). TnoM is the threshold number of misclassification which measures the “relevance” of a gene. Inspection of the variance and Gap plots under the full supervised scenario indicates that only the first cluster captures the full external classification.

Bottom Line: Although the two approaches have been developed from different perspectives, the results demonstrate a clear correspondence between gene clusters produced by GeneClust and EMMIX-GENE for the colon tissue data.The intent of the EMMIX-GENE method is to cluster the tissue samples.It performs a filtering step that results in a subset of relevant genes, followed by gene clustering, and then tissue clustering, and is favorable in its accuracy of ranking the clusters produced.

View Article: PubMed Central - PubMed

Affiliation: University of Texas, M.D. Anderson Cancer Center, Houston, Texas, USA. kim@mdanderson.org

ABSTRACT
Researchers are frequently faced with the analysis of microarray data of a relatively large number of genes using a small number of tissue samples. We examine the application of two statistical methods for clustering such microarray expression data: EMMIX-GENE and GeneClust. EMMIX-GENE is a mixture-model based clustering approach, designed primarily to cluster tissue samples on the basis of the genes. GeneClust is an implementation of the gene shaving methodology, motivated by research to identify distinct sets of genes for which variation in expression could be related to a biological property of the tissue samples. We illustrate the use of these two methods in the analysis of Affymetrix oligonucleotide arrays of well-known data sets from colon tissue samples with and without tumors, and of tumor tissue samples from patients with leukemia. Although the two approaches have been developed from different perspectives, the results demonstrate a clear correspondence between gene clusters produced by GeneClust and EMMIX-GENE for the colon tissue data. It is demonstrated, for the case of ribosomal proteins and smooth muscle genes in the colon data set, that both methods can classify genes into co-regulated families. It is further demonstrated that tissue types (tumor and normal) can be separated on the basis of subtle distributed patterns of genes. Application to the leukemia tissue data produces a division of tissues corresponding closely to the external classification, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), for both methods. In addition, we also identify genes specific for the subgroup of ALL-Tcell samples. Overall, we find that the gene shaving method produces gene clusters at great speed; allows variable cluster sizes and can incorporate partial or full supervision; and finds clusters of genes in which the gene expression varies greatly over the tissue samples while maintaining a high level of coherence between the gene expression profiles. The intent of the EMMIX-GENE method is to cluster the tissue samples. It performs a filtering step that results in a subset of relevant genes, followed by gene clustering, and then tissue clustering, and is favorable in its accuracy of ranking the clusters produced.

No MeSH data available.


Related in: MedlinePlus