Limits...
SPARCoC: a new framework for molecular pattern discovery and cancer gene identification.

Ma S, Johnson D, Ashby C, Xiong D, Cramer CL, Moore JH, Zhang S, Huang X - PLoS ONE (2015)

Bottom Line: Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes.SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF).We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping.

View Article: PubMed Central - PubMed

Affiliation: Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T. Hong Kong.

ABSTRACT
It is challenging to cluster cancer patients of a certain histopathological type into molecular subtypes of clinical importance and identify gene signatures directly relevant to the subtypes. Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes. In this paper we present a new framework: SPARCoC (Sparse-CoClust), which is based on a novel Common-background and Sparse-foreground Decomposition (CSD) model and the Maximum Block Improvement (MBI) co-clustering technique. SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF). We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping. For testing and verification, we use high quality gene expression profiling data of lung ADCA patients, and identify prognostic gene signatures which could cluster patients into subgroups that are significantly different in their overall survival (with p-values < 0.05). Our results are only based on gene expression profiling data analysis, without incorporating any other feature selection or clinical information; we are able to replicate our findings with completely independent datasets. SPARCoC is broadly applicable to large-scale genomic data to empower pattern discovery and cancer gene identification.

No MeSH data available.


Related in: MedlinePlus

Clustering of stage I samples.(a) and (b). Kaplan-Meier plots of the consistent clustering of ACC stage1 (a) and that of Jacob stage1 (b) from our clustering approach. The clusters identified by our clustering approach show statistically significant survival differences. (c) and (d). Comparison of the sample separation based on the 144 identified genes and the separation based on the stage information of the GSE5843 dataset. (c) Independent verification testing of the 144 identified genes on GSE5843. Kaplan-Meier plots of the clusters of the samples, which shows statistically significant survival differences. The clusters is from MBI running with parameter k2 = 2 on the corresponding rows of the dataset (i.e., using only the part of Y matrix of GSE5843 that corresponds to the 144 genes). p-value = 0.0249. (d) Kaplan-Meier plots of the clusters of the samples based on the separation of stage IA and IB. p-value = 0.026.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4359112&req=5

pone.0117135.g006: Clustering of stage I samples.(a) and (b). Kaplan-Meier plots of the consistent clustering of ACC stage1 (a) and that of Jacob stage1 (b) from our clustering approach. The clusters identified by our clustering approach show statistically significant survival differences. (c) and (d). Comparison of the sample separation based on the 144 identified genes and the separation based on the stage information of the GSE5843 dataset. (c) Independent verification testing of the 144 identified genes on GSE5843. Kaplan-Meier plots of the clusters of the samples, which shows statistically significant survival differences. The clusters is from MBI running with parameter k2 = 2 on the corresponding rows of the dataset (i.e., using only the part of Y matrix of GSE5843 that corresponds to the 144 genes). p-value = 0.0249. (d) Kaplan-Meier plots of the clusters of the samples based on the separation of stage IA and IB. p-value = 0.026.

Mentions: For discovery we used the datasets of ACCstage1 and Jacobstage1 as the training datasets. We first applied our clustering approach to the datasets and got a consistent clustering for each dataset respectively. Kaplan-Meier plots showed significant differences in OS (p = 0.0164 for ACCstage1 and p = 0.0018 for Jacobstage1 by log-rank test) between the 2 clusters of patients for each dataset (Fig. 6A and Fig. 6B).


SPARCoC: a new framework for molecular pattern discovery and cancer gene identification.

Ma S, Johnson D, Ashby C, Xiong D, Cramer CL, Moore JH, Zhang S, Huang X - PLoS ONE (2015)

Clustering of stage I samples.(a) and (b). Kaplan-Meier plots of the consistent clustering of ACC stage1 (a) and that of Jacob stage1 (b) from our clustering approach. The clusters identified by our clustering approach show statistically significant survival differences. (c) and (d). Comparison of the sample separation based on the 144 identified genes and the separation based on the stage information of the GSE5843 dataset. (c) Independent verification testing of the 144 identified genes on GSE5843. Kaplan-Meier plots of the clusters of the samples, which shows statistically significant survival differences. The clusters is from MBI running with parameter k2 = 2 on the corresponding rows of the dataset (i.e., using only the part of Y matrix of GSE5843 that corresponds to the 144 genes). p-value = 0.0249. (d) Kaplan-Meier plots of the clusters of the samples based on the separation of stage IA and IB. p-value = 0.026.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4359112&req=5

pone.0117135.g006: Clustering of stage I samples.(a) and (b). Kaplan-Meier plots of the consistent clustering of ACC stage1 (a) and that of Jacob stage1 (b) from our clustering approach. The clusters identified by our clustering approach show statistically significant survival differences. (c) and (d). Comparison of the sample separation based on the 144 identified genes and the separation based on the stage information of the GSE5843 dataset. (c) Independent verification testing of the 144 identified genes on GSE5843. Kaplan-Meier plots of the clusters of the samples, which shows statistically significant survival differences. The clusters is from MBI running with parameter k2 = 2 on the corresponding rows of the dataset (i.e., using only the part of Y matrix of GSE5843 that corresponds to the 144 genes). p-value = 0.0249. (d) Kaplan-Meier plots of the clusters of the samples based on the separation of stage IA and IB. p-value = 0.026.
Mentions: For discovery we used the datasets of ACCstage1 and Jacobstage1 as the training datasets. We first applied our clustering approach to the datasets and got a consistent clustering for each dataset respectively. Kaplan-Meier plots showed significant differences in OS (p = 0.0164 for ACCstage1 and p = 0.0018 for Jacobstage1 by log-rank test) between the 2 clusters of patients for each dataset (Fig. 6A and Fig. 6B).

Bottom Line: Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes.SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF).We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping.

View Article: PubMed Central - PubMed

Affiliation: Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T. Hong Kong.

ABSTRACT
It is challenging to cluster cancer patients of a certain histopathological type into molecular subtypes of clinical importance and identify gene signatures directly relevant to the subtypes. Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes. In this paper we present a new framework: SPARCoC (Sparse-CoClust), which is based on a novel Common-background and Sparse-foreground Decomposition (CSD) model and the Maximum Block Improvement (MBI) co-clustering technique. SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF). We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping. For testing and verification, we use high quality gene expression profiling data of lung ADCA patients, and identify prognostic gene signatures which could cluster patients into subgroups that are significantly different in their overall survival (with p-values < 0.05). Our results are only based on gene expression profiling data analysis, without incorporating any other feature selection or clinical information; we are able to replicate our findings with completely independent datasets. SPARCoC is broadly applicable to large-scale genomic data to empower pattern discovery and cancer gene identification.

No MeSH data available.


Related in: MedlinePlus