Limits...
SPARCoC: a new framework for molecular pattern discovery and cancer gene identification.

Ma S, Johnson D, Ashby C, Xiong D, Cramer CL, Moore JH, Zhang S, Huang X - PLoS ONE (2015)

Bottom Line: Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes.SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF).We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping.

View Article: PubMed Central - PubMed

Affiliation: Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T. Hong Kong.

ABSTRACT
It is challenging to cluster cancer patients of a certain histopathological type into molecular subtypes of clinical importance and identify gene signatures directly relevant to the subtypes. Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes. In this paper we present a new framework: SPARCoC (Sparse-CoClust), which is based on a novel Common-background and Sparse-foreground Decomposition (CSD) model and the Maximum Block Improvement (MBI) co-clustering technique. SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF). We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping. For testing and verification, we use high quality gene expression profiling data of lung ADCA patients, and identify prognostic gene signatures which could cluster patients into subgroups that are significantly different in their overall survival (with p-values < 0.05). Our results are only based on gene expression profiling data analysis, without incorporating any other feature selection or clinical information; we are able to replicate our findings with completely independent datasets. SPARCoC is broadly applicable to large-scale genomic data to empower pattern discovery and cancer gene identification.

No MeSH data available.


Related in: MedlinePlus

Consistent TM clusters and consistent HM clusters. (a) TM clusters and (b) HM clusters. Kaplan-Meier plots of the two consistent TM clusters and the two consistent HM clusters from our clustering approach with CSD (decomposition noise level is 20,000) and MBI (10 runs with parameter k2 = 2). (c) Sample classification of TM samples using the 128 genes, based on the subtypes of HM, and (d) Sample classification of HM samples using the 128 genes, based on the subtypes of TM. TM and HM cross-validation (CV) using 128 genes (where 128 genes are identified from all genes based on clusters of MBI 10 runs on TM and MBI 10 runs on HM; For cross-validation, least square based sample prediction is applied).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4359112&req=5

pone.0117135.g004: Consistent TM clusters and consistent HM clusters. (a) TM clusters and (b) HM clusters. Kaplan-Meier plots of the two consistent TM clusters and the two consistent HM clusters from our clustering approach with CSD (decomposition noise level is 20,000) and MBI (10 runs with parameter k2 = 2). (c) Sample classification of TM samples using the 128 genes, based on the subtypes of HM, and (d) Sample classification of HM samples using the 128 genes, based on the subtypes of TM. TM and HM cross-validation (CV) using 128 genes (where 128 genes are identified from all genes based on clusters of MBI 10 runs on TM and MBI 10 runs on HM; For cross-validation, least square based sample prediction is applied).

Mentions: The TM and HM datasets were used as the training datasets for our analysis. We first applied our MBI clustering approach to the decomposed TM and HM data matrices from CSD and obtained consistent clusters for TM and HM samples respectively. Kaplan-Meier plots showed statistically significant differences in overall survival (OS) (p-values: p = 0.00323 for TM and p = 0.0106 for HM by log-rank test) between the two clusters of patients for each dataset (Fig. 4A and Fig. 4B). The results of the leave-one-out-cross-verification (LOOCV) of the two clusters of TM and HM are 0.96 and 0.80, respectively.


SPARCoC: a new framework for molecular pattern discovery and cancer gene identification.

Ma S, Johnson D, Ashby C, Xiong D, Cramer CL, Moore JH, Zhang S, Huang X - PLoS ONE (2015)

Consistent TM clusters and consistent HM clusters. (a) TM clusters and (b) HM clusters. Kaplan-Meier plots of the two consistent TM clusters and the two consistent HM clusters from our clustering approach with CSD (decomposition noise level is 20,000) and MBI (10 runs with parameter k2 = 2). (c) Sample classification of TM samples using the 128 genes, based on the subtypes of HM, and (d) Sample classification of HM samples using the 128 genes, based on the subtypes of TM. TM and HM cross-validation (CV) using 128 genes (where 128 genes are identified from all genes based on clusters of MBI 10 runs on TM and MBI 10 runs on HM; For cross-validation, least square based sample prediction is applied).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4359112&req=5

pone.0117135.g004: Consistent TM clusters and consistent HM clusters. (a) TM clusters and (b) HM clusters. Kaplan-Meier plots of the two consistent TM clusters and the two consistent HM clusters from our clustering approach with CSD (decomposition noise level is 20,000) and MBI (10 runs with parameter k2 = 2). (c) Sample classification of TM samples using the 128 genes, based on the subtypes of HM, and (d) Sample classification of HM samples using the 128 genes, based on the subtypes of TM. TM and HM cross-validation (CV) using 128 genes (where 128 genes are identified from all genes based on clusters of MBI 10 runs on TM and MBI 10 runs on HM; For cross-validation, least square based sample prediction is applied).
Mentions: The TM and HM datasets were used as the training datasets for our analysis. We first applied our MBI clustering approach to the decomposed TM and HM data matrices from CSD and obtained consistent clusters for TM and HM samples respectively. Kaplan-Meier plots showed statistically significant differences in overall survival (OS) (p-values: p = 0.00323 for TM and p = 0.0106 for HM by log-rank test) between the two clusters of patients for each dataset (Fig. 4A and Fig. 4B). The results of the leave-one-out-cross-verification (LOOCV) of the two clusters of TM and HM are 0.96 and 0.80, respectively.

Bottom Line: Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes.SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF).We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping.

View Article: PubMed Central - PubMed

Affiliation: Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T. Hong Kong.

ABSTRACT
It is challenging to cluster cancer patients of a certain histopathological type into molecular subtypes of clinical importance and identify gene signatures directly relevant to the subtypes. Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes. In this paper we present a new framework: SPARCoC (Sparse-CoClust), which is based on a novel Common-background and Sparse-foreground Decomposition (CSD) model and the Maximum Block Improvement (MBI) co-clustering technique. SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF). We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping. For testing and verification, we use high quality gene expression profiling data of lung ADCA patients, and identify prognostic gene signatures which could cluster patients into subgroups that are significantly different in their overall survival (with p-values < 0.05). Our results are only based on gene expression profiling data analysis, without incorporating any other feature selection or clinical information; we are able to replicate our findings with completely independent datasets. SPARCoC is broadly applicable to large-scale genomic data to empower pattern discovery and cancer gene identification.

No MeSH data available.


Related in: MedlinePlus