Limits...
SPARCoC: a new framework for molecular pattern discovery and cancer gene identification.

Ma S, Johnson D, Ashby C, Xiong D, Cramer CL, Moore JH, Zhang S, Huang X - PLoS ONE (2015)

Bottom Line: Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes.SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF).We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping.

View Article: PubMed Central - PubMed

Affiliation: Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T. Hong Kong.

ABSTRACT
It is challenging to cluster cancer patients of a certain histopathological type into molecular subtypes of clinical importance and identify gene signatures directly relevant to the subtypes. Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes. In this paper we present a new framework: SPARCoC (Sparse-CoClust), which is based on a novel Common-background and Sparse-foreground Decomposition (CSD) model and the Maximum Block Improvement (MBI) co-clustering technique. SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF). We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping. For testing and verification, we use high quality gene expression profiling data of lung ADCA patients, and identify prognostic gene signatures which could cluster patients into subgroups that are significantly different in their overall survival (with p-values < 0.05). Our results are only based on gene expression profiling data analysis, without incorporating any other feature selection or clinical information; we are able to replicate our findings with completely independent datasets. SPARCoC is broadly applicable to large-scale genomic data to empower pattern discovery and cancer gene identification.

No MeSH data available.


Related in: MedlinePlus

Overview of the new clustering framework.(a) An artificial example: Given the input gene expression M matrix, where are the “interesting genes” hidden? (i.e., which are the genes significant for distinguishing the potential different molecular subtypes?) The “interesting” genes are not easily detected from the given M matrix using the current popular clustering methods, e.g., NMF or Hclust. However, we could clearly see the “foreground” (a co-cluster of size 5×5, shown in green of the Y matrix) after the distractive “background” X matrix is removed through the decomposition. The “interesting” genes (rows 10–14) are differentially expressed for samples/columns 10–14 of the Y matrix. (b) The new clustering framework. This new framework includes two modules: the common-background and sparse-foreground decomposition (CSD) and the Maximum Block Improvement (MBI) co-clustering. Given an M matrix, the CSD module will decompose M and generate a “foreground” Y matrix; Then, the MBI co-clustering module will work on the Y matrix and output the co-clusters, providing the information of groups of samples and groups of genes that are associated with certain groups of samples. Our clustering framework conducts clustering by “sparse-foreground” commonality, while many current clustering methods usually conduct clustering by “background” commonality.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4359112&req=5

pone.0117135.g001: Overview of the new clustering framework.(a) An artificial example: Given the input gene expression M matrix, where are the “interesting genes” hidden? (i.e., which are the genes significant for distinguishing the potential different molecular subtypes?) The “interesting” genes are not easily detected from the given M matrix using the current popular clustering methods, e.g., NMF or Hclust. However, we could clearly see the “foreground” (a co-cluster of size 5×5, shown in green of the Y matrix) after the distractive “background” X matrix is removed through the decomposition. The “interesting” genes (rows 10–14) are differentially expressed for samples/columns 10–14 of the Y matrix. (b) The new clustering framework. This new framework includes two modules: the common-background and sparse-foreground decomposition (CSD) and the Maximum Block Improvement (MBI) co-clustering. Given an M matrix, the CSD module will decompose M and generate a “foreground” Y matrix; Then, the MBI co-clustering module will work on the Y matrix and output the co-clusters, providing the information of groups of samples and groups of genes that are associated with certain groups of samples. Our clustering framework conducts clustering by “sparse-foreground” commonality, while many current clustering methods usually conduct clustering by “background” commonality.

Mentions: Realizing one of the inherent limitations of existing methods is that the common features in the background of the large scale genomic data of cancer patients may obscure the detection of rare but crucial data variations, i.e., the important genomic features defining the fine detailed molecular subtypes of patients. As in imaging processing, when presented with thousands of surveillance pictures of the same background area, if we could remove the distraction of the common background and just focus on the sparse interesting foreground information, we could easily and clearly detect the important patterns. Here, we present SPARCoC (Sparse-CoClust), a new unsupervised clustering framework for discovering molecular patterns and cancer molecular subtypes. The framework is based on a scheme known as common-background sparse-foreground decomposition (CSD) and a technique known as Maximum Block Improvement (MBI) checkerboard co-clustering. This new framework appears to have significant advantages in cancer molecular subtyping and gene signature identification. As we will see later by an example (Fig. 1A) that clustering by commonality (which is the philosophy behind almost all existing clustering methods) is fundamentally flawed in the context of cancer molecular subtyping. Instead, the ability to detect the abnormality hidden behind the common background is the core feature of our new approach.


SPARCoC: a new framework for molecular pattern discovery and cancer gene identification.

Ma S, Johnson D, Ashby C, Xiong D, Cramer CL, Moore JH, Zhang S, Huang X - PLoS ONE (2015)

Overview of the new clustering framework.(a) An artificial example: Given the input gene expression M matrix, where are the “interesting genes” hidden? (i.e., which are the genes significant for distinguishing the potential different molecular subtypes?) The “interesting” genes are not easily detected from the given M matrix using the current popular clustering methods, e.g., NMF or Hclust. However, we could clearly see the “foreground” (a co-cluster of size 5×5, shown in green of the Y matrix) after the distractive “background” X matrix is removed through the decomposition. The “interesting” genes (rows 10–14) are differentially expressed for samples/columns 10–14 of the Y matrix. (b) The new clustering framework. This new framework includes two modules: the common-background and sparse-foreground decomposition (CSD) and the Maximum Block Improvement (MBI) co-clustering. Given an M matrix, the CSD module will decompose M and generate a “foreground” Y matrix; Then, the MBI co-clustering module will work on the Y matrix and output the co-clusters, providing the information of groups of samples and groups of genes that are associated with certain groups of samples. Our clustering framework conducts clustering by “sparse-foreground” commonality, while many current clustering methods usually conduct clustering by “background” commonality.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4359112&req=5

pone.0117135.g001: Overview of the new clustering framework.(a) An artificial example: Given the input gene expression M matrix, where are the “interesting genes” hidden? (i.e., which are the genes significant for distinguishing the potential different molecular subtypes?) The “interesting” genes are not easily detected from the given M matrix using the current popular clustering methods, e.g., NMF or Hclust. However, we could clearly see the “foreground” (a co-cluster of size 5×5, shown in green of the Y matrix) after the distractive “background” X matrix is removed through the decomposition. The “interesting” genes (rows 10–14) are differentially expressed for samples/columns 10–14 of the Y matrix. (b) The new clustering framework. This new framework includes two modules: the common-background and sparse-foreground decomposition (CSD) and the Maximum Block Improvement (MBI) co-clustering. Given an M matrix, the CSD module will decompose M and generate a “foreground” Y matrix; Then, the MBI co-clustering module will work on the Y matrix and output the co-clusters, providing the information of groups of samples and groups of genes that are associated with certain groups of samples. Our clustering framework conducts clustering by “sparse-foreground” commonality, while many current clustering methods usually conduct clustering by “background” commonality.
Mentions: Realizing one of the inherent limitations of existing methods is that the common features in the background of the large scale genomic data of cancer patients may obscure the detection of rare but crucial data variations, i.e., the important genomic features defining the fine detailed molecular subtypes of patients. As in imaging processing, when presented with thousands of surveillance pictures of the same background area, if we could remove the distraction of the common background and just focus on the sparse interesting foreground information, we could easily and clearly detect the important patterns. Here, we present SPARCoC (Sparse-CoClust), a new unsupervised clustering framework for discovering molecular patterns and cancer molecular subtypes. The framework is based on a scheme known as common-background sparse-foreground decomposition (CSD) and a technique known as Maximum Block Improvement (MBI) checkerboard co-clustering. This new framework appears to have significant advantages in cancer molecular subtyping and gene signature identification. As we will see later by an example (Fig. 1A) that clustering by commonality (which is the philosophy behind almost all existing clustering methods) is fundamentally flawed in the context of cancer molecular subtyping. Instead, the ability to detect the abnormality hidden behind the common background is the core feature of our new approach.

Bottom Line: Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes.SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF).We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping.

View Article: PubMed Central - PubMed

Affiliation: Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T. Hong Kong.

ABSTRACT
It is challenging to cluster cancer patients of a certain histopathological type into molecular subtypes of clinical importance and identify gene signatures directly relevant to the subtypes. Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes. In this paper we present a new framework: SPARCoC (Sparse-CoClust), which is based on a novel Common-background and Sparse-foreground Decomposition (CSD) model and the Maximum Block Improvement (MBI) co-clustering technique. SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF). We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping. For testing and verification, we use high quality gene expression profiling data of lung ADCA patients, and identify prognostic gene signatures which could cluster patients into subgroups that are significantly different in their overall survival (with p-values < 0.05). Our results are only based on gene expression profiling data analysis, without incorporating any other feature selection or clinical information; we are able to replicate our findings with completely independent datasets. SPARCoC is broadly applicable to large-scale genomic data to empower pattern discovery and cancer gene identification.

No MeSH data available.


Related in: MedlinePlus