Limits...
SPARCoC: a new framework for molecular pattern discovery and cancer gene identification.

Ma S, Johnson D, Ashby C, Xiong D, Cramer CL, Moore JH, Zhang S, Huang X - PLoS ONE (2015)

Bottom Line: Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes.SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF).We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping.

View Article: PubMed Central - PubMed

Affiliation: Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T. Hong Kong.

ABSTRACT
It is challenging to cluster cancer patients of a certain histopathological type into molecular subtypes of clinical importance and identify gene signatures directly relevant to the subtypes. Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes. In this paper we present a new framework: SPARCoC (Sparse-CoClust), which is based on a novel Common-background and Sparse-foreground Decomposition (CSD) model and the Maximum Block Improvement (MBI) co-clustering technique. SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF). We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping. For testing and verification, we use high quality gene expression profiling data of lung ADCA patients, and identify prognostic gene signatures which could cluster patients into subgroups that are significantly different in their overall survival (with p-values < 0.05). Our results are only based on gene expression profiling data analysis, without incorporating any other feature selection or clinical information; we are able to replicate our findings with completely independent datasets. SPARCoC is broadly applicable to large-scale genomic data to empower pattern discovery and cancer gene identification.

No MeSH data available.


Related in: MedlinePlus

Comparison of the cluster of Hierarchical clustering (Hclust) versus that of MBI, and the cluster of NMF versus that of MBI.(a) and (b). Comparison of Kaplan-Meier survival plots based on the unsupervised clusters of Hierarchical clustering (Hclust) and that of MBI, when given the same gene expression matrix M (lung ADCA Canada dataset from Shedden et al. [7]. (a) Kaplan-Meier survival plot based on Hclust. (b) Kaplan-Meier survival plot based on MBI clustering (with leave-one-out-cross-validation (LOOCV) ~99% accuracy). MBI shows a better separation of the aggressive subgroup from the other two subgroups compared with the Hclust Bryant et al. [6]. The p-values are calculated by log-rank test; The LOOCV was done using PAM [18]. (c) and (d). Comparison of Kaplan-Meier survival plots based on the unsupervised clustering of NMF (c) and that of MBI (d), when given the same gene expression matrix M (lung ADCA Canada dataset from Shedden et al. [7]). When given the same gene expression testing data, the survival curves from MBI clustering shows a more significant separation than those from NMF clustering. The p-values are calculated by log-rank test.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4359112&req=5

pone.0117135.g003: Comparison of the cluster of Hierarchical clustering (Hclust) versus that of MBI, and the cluster of NMF versus that of MBI.(a) and (b). Comparison of Kaplan-Meier survival plots based on the unsupervised clusters of Hierarchical clustering (Hclust) and that of MBI, when given the same gene expression matrix M (lung ADCA Canada dataset from Shedden et al. [7]. (a) Kaplan-Meier survival plot based on Hclust. (b) Kaplan-Meier survival plot based on MBI clustering (with leave-one-out-cross-validation (LOOCV) ~99% accuracy). MBI shows a better separation of the aggressive subgroup from the other two subgroups compared with the Hclust Bryant et al. [6]. The p-values are calculated by log-rank test; The LOOCV was done using PAM [18]. (c) and (d). Comparison of Kaplan-Meier survival plots based on the unsupervised clustering of NMF (c) and that of MBI (d), when given the same gene expression matrix M (lung ADCA Canada dataset from Shedden et al. [7]). When given the same gene expression testing data, the survival curves from MBI clustering shows a more significant separation than those from NMF clustering. The p-values are calculated by log-rank test.

Mentions: Refer to the testing results provided here and in the Supporting Information (seeS1 Filefor additional testing results), which demonstrate the clear advantages of our new clustering framework. Our testing results show that: (1) the CSD approach facilitates the identification of gene markers, making potential gene markers stand out of the “background”; (2) the MBI approach performs better on Y versus on M, where M is the original gene expression matrix and Y is the sparse matrix generated through CSD decomposition; (3) our new clustering framework performs much better in comparison with the widely used clustering approaches, e.g., Hclust and NMF (also see Fig. 3A and 3B, Fig. 3C and 3D; the smaller p-values from log rank test (Fig. 3; Table 2) and the lower percentages of 3-year overall survival of high-risk groups (also seeS1 Filefor additional testing results) implicate our CSD+MBI model is a better clustering model).


SPARCoC: a new framework for molecular pattern discovery and cancer gene identification.

Ma S, Johnson D, Ashby C, Xiong D, Cramer CL, Moore JH, Zhang S, Huang X - PLoS ONE (2015)

Comparison of the cluster of Hierarchical clustering (Hclust) versus that of MBI, and the cluster of NMF versus that of MBI.(a) and (b). Comparison of Kaplan-Meier survival plots based on the unsupervised clusters of Hierarchical clustering (Hclust) and that of MBI, when given the same gene expression matrix M (lung ADCA Canada dataset from Shedden et al. [7]. (a) Kaplan-Meier survival plot based on Hclust. (b) Kaplan-Meier survival plot based on MBI clustering (with leave-one-out-cross-validation (LOOCV) ~99% accuracy). MBI shows a better separation of the aggressive subgroup from the other two subgroups compared with the Hclust Bryant et al. [6]. The p-values are calculated by log-rank test; The LOOCV was done using PAM [18]. (c) and (d). Comparison of Kaplan-Meier survival plots based on the unsupervised clustering of NMF (c) and that of MBI (d), when given the same gene expression matrix M (lung ADCA Canada dataset from Shedden et al. [7]). When given the same gene expression testing data, the survival curves from MBI clustering shows a more significant separation than those from NMF clustering. The p-values are calculated by log-rank test.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4359112&req=5

pone.0117135.g003: Comparison of the cluster of Hierarchical clustering (Hclust) versus that of MBI, and the cluster of NMF versus that of MBI.(a) and (b). Comparison of Kaplan-Meier survival plots based on the unsupervised clusters of Hierarchical clustering (Hclust) and that of MBI, when given the same gene expression matrix M (lung ADCA Canada dataset from Shedden et al. [7]. (a) Kaplan-Meier survival plot based on Hclust. (b) Kaplan-Meier survival plot based on MBI clustering (with leave-one-out-cross-validation (LOOCV) ~99% accuracy). MBI shows a better separation of the aggressive subgroup from the other two subgroups compared with the Hclust Bryant et al. [6]. The p-values are calculated by log-rank test; The LOOCV was done using PAM [18]. (c) and (d). Comparison of Kaplan-Meier survival plots based on the unsupervised clustering of NMF (c) and that of MBI (d), when given the same gene expression matrix M (lung ADCA Canada dataset from Shedden et al. [7]). When given the same gene expression testing data, the survival curves from MBI clustering shows a more significant separation than those from NMF clustering. The p-values are calculated by log-rank test.
Mentions: Refer to the testing results provided here and in the Supporting Information (seeS1 Filefor additional testing results), which demonstrate the clear advantages of our new clustering framework. Our testing results show that: (1) the CSD approach facilitates the identification of gene markers, making potential gene markers stand out of the “background”; (2) the MBI approach performs better on Y versus on M, where M is the original gene expression matrix and Y is the sparse matrix generated through CSD decomposition; (3) our new clustering framework performs much better in comparison with the widely used clustering approaches, e.g., Hclust and NMF (also see Fig. 3A and 3B, Fig. 3C and 3D; the smaller p-values from log rank test (Fig. 3; Table 2) and the lower percentages of 3-year overall survival of high-risk groups (also seeS1 Filefor additional testing results) implicate our CSD+MBI model is a better clustering model).

Bottom Line: Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes.SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF).We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping.

View Article: PubMed Central - PubMed

Affiliation: Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T. Hong Kong.

ABSTRACT
It is challenging to cluster cancer patients of a certain histopathological type into molecular subtypes of clinical importance and identify gene signatures directly relevant to the subtypes. Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes. In this paper we present a new framework: SPARCoC (Sparse-CoClust), which is based on a novel Common-background and Sparse-foreground Decomposition (CSD) model and the Maximum Block Improvement (MBI) co-clustering technique. SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF). We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping. For testing and verification, we use high quality gene expression profiling data of lung ADCA patients, and identify prognostic gene signatures which could cluster patients into subgroups that are significantly different in their overall survival (with p-values < 0.05). Our results are only based on gene expression profiling data analysis, without incorporating any other feature selection or clinical information; we are able to replicate our findings with completely independent datasets. SPARCoC is broadly applicable to large-scale genomic data to empower pattern discovery and cancer gene identification.

No MeSH data available.


Related in: MedlinePlus