Limits...
Integrative analysis of survival-associated gene sets in breast cancer.

Varn FS, Ung MH, Lou SK, Cheng C - BMC Med Genomics (2015)

Bottom Line: Using the GSAS metric, we identified 120 gene sets that were significantly associated with patient survival in all datasets tested.Most interestingly, removal of the genes in this gene set from the gene pool on MSigDB resulted in a large reduction in the number of predictive gene sets, suggesting a prominent role for these genes in breast cancer progression.We used this metric to identify predictive gene sets and to construct a novel gene set containing genes heavily involved in cancer progression.

View Article: PubMed Central - PubMed

Affiliation: Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, 03755, USA. Frederick.S.Varn.Jr.GR@dartmouth.edu.

ABSTRACT

Background: Patient gene expression information has recently become a clinical feature used to evaluate breast cancer prognosis. The emergence of prognostic gene sets that take advantage of these data has led to a rich library of information that can be used to characterize the molecular nature of a patient's cancer. Identifying robust gene sets that are consistently predictive of a patient's clinical outcome has become one of the main challenges in the field.

Methods: We inputted our previously established BASE algorithm with patient gene expression data and gene sets from MSigDB to develop the gene set activity score (GSAS), a metric that quantitatively assesses a gene set's activity level in a given patient. We utilized this metric, along with patient time-to-event data, to perform survival analyses to identify the gene sets that were significantly correlated with patient survival. We then performed cross-dataset analyses to identify robust prognostic gene sets and to classify patients by metastasis status. Additionally, we created a gene set network based on component gene overlap to explore the relationship between gene sets derived from MSigDB. We developed a novel gene set based on this network's topology and applied the GSAS metric to characterize its role in patient survival.

Results: Using the GSAS metric, we identified 120 gene sets that were significantly associated with patient survival in all datasets tested. The gene overlap network analysis yielded a novel gene set enriched in genes shared by the robustly predictive gene sets. This gene set was highly correlated to patient survival when used alone. Most interestingly, removal of the genes in this gene set from the gene pool on MSigDB resulted in a large reduction in the number of predictive gene sets, suggesting a prominent role for these genes in breast cancer progression.

Conclusions: The GSAS metric provided a useful medium by which we systematically investigated how gene sets from MSigDB relate to breast cancer patient survival. We used this metric to identify predictive gene sets and to construct a novel gene set containing genes heavily involved in cancer progression.

No MeSH data available.


Related in: MedlinePlus

Expansion of analysis to 7 datasets yields 120 robust signatures. (A) Overlap analysis of gene sets significantly correlated with survival across seven datasets (p < 0.05) reveals 120 “robust” gene sets. (B) A gene signature containing genes that are differentially expressed between pediatric tumors and normal tissue. (C) A gene signature whose genes are associated with periodic cell cycle expression. (D) A gene signature whose genes are associated with histologic grade. (E) A gene signature containing genes stimulated by EGF in HeLa cells. (B-E) are examples of the 120 robust gene sets. In (B) and (C), high signature activity has a protective effect (hr < 1), while in (D) and (E) high signature activity has a deleterious effect (hr > 1).
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4359519&req=5

Fig3: Expansion of analysis to 7 datasets yields 120 robust signatures. (A) Overlap analysis of gene sets significantly correlated with survival across seven datasets (p < 0.05) reveals 120 “robust” gene sets. (B) A gene signature containing genes that are differentially expressed between pediatric tumors and normal tissue. (C) A gene signature whose genes are associated with periodic cell cycle expression. (D) A gene signature whose genes are associated with histologic grade. (E) A gene signature containing genes stimulated by EGF in HeLa cells. (B-E) are examples of the 120 robust gene sets. In (B) and (C), high signature activity has a protective effect (hr < 1), while in (D) and (E) high signature activity has a deleterious effect (hr > 1).

Mentions: Gene sets whose predictive value is localized to only a few datasets are not of much use clinically, as the association between patient survival and gene set activity may be due to confounding factors in the datasets or a small overall patient sample size. Gene sets with consistent prognostic performance across many datasets are less likely to have these problems, as they have been tested over a more diverse setting with a greater overall patient sample size. To identify gene sets such as these, we first extended our analysis done in the van de Vijver et al. dataset (see above) to four more datasets [8,10,28,29] and two sub-datasets derived from a meta-analysis study [26] for a total of seven independent analyses. Each analysis yielded a list of gene sets ranked by its fit in a Cox proportional hazards model based off the given dataset’s patient survival information. We then identified gene sets that remained predictive of survival across all seven datasets by selecting for the ones that had a Cox proportional hazards p-value < 0.05 in every analysis. This yielded a list of 120 gene sets that were significantly associated with patient survival across every dataset we tested (Figure 3A, Additional file 1). Examples of some of these robustly predictive gene sets are shown in Figures 3B-E as survival curves generated from the van de Vijver dataset. The two gene sets in Figures 3B and C have a hazard ratio greater than 1 (hr = 1.33 and 1.26, respectively). In these gene sets, higher activity is associated with decreased survival, indicating that these gene sets are representative of a deleterious expression signature. The opposite is shown in the gene sets represented in Figures 3D and E, in which both gene sets have a hazard ratio less than 1 (hr = 0.75 and 0.83, respectively). For these gene sets, increased gene set activity is associated with increased survival time. This suggests that these gene sets are indicative of a gene expression profile exhibited by patients having a favorable prognosis.Figure 3


Integrative analysis of survival-associated gene sets in breast cancer.

Varn FS, Ung MH, Lou SK, Cheng C - BMC Med Genomics (2015)

Expansion of analysis to 7 datasets yields 120 robust signatures. (A) Overlap analysis of gene sets significantly correlated with survival across seven datasets (p < 0.05) reveals 120 “robust” gene sets. (B) A gene signature containing genes that are differentially expressed between pediatric tumors and normal tissue. (C) A gene signature whose genes are associated with periodic cell cycle expression. (D) A gene signature whose genes are associated with histologic grade. (E) A gene signature containing genes stimulated by EGF in HeLa cells. (B-E) are examples of the 120 robust gene sets. In (B) and (C), high signature activity has a protective effect (hr < 1), while in (D) and (E) high signature activity has a deleterious effect (hr > 1).
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4359519&req=5

Fig3: Expansion of analysis to 7 datasets yields 120 robust signatures. (A) Overlap analysis of gene sets significantly correlated with survival across seven datasets (p < 0.05) reveals 120 “robust” gene sets. (B) A gene signature containing genes that are differentially expressed between pediatric tumors and normal tissue. (C) A gene signature whose genes are associated with periodic cell cycle expression. (D) A gene signature whose genes are associated with histologic grade. (E) A gene signature containing genes stimulated by EGF in HeLa cells. (B-E) are examples of the 120 robust gene sets. In (B) and (C), high signature activity has a protective effect (hr < 1), while in (D) and (E) high signature activity has a deleterious effect (hr > 1).
Mentions: Gene sets whose predictive value is localized to only a few datasets are not of much use clinically, as the association between patient survival and gene set activity may be due to confounding factors in the datasets or a small overall patient sample size. Gene sets with consistent prognostic performance across many datasets are less likely to have these problems, as they have been tested over a more diverse setting with a greater overall patient sample size. To identify gene sets such as these, we first extended our analysis done in the van de Vijver et al. dataset (see above) to four more datasets [8,10,28,29] and two sub-datasets derived from a meta-analysis study [26] for a total of seven independent analyses. Each analysis yielded a list of gene sets ranked by its fit in a Cox proportional hazards model based off the given dataset’s patient survival information. We then identified gene sets that remained predictive of survival across all seven datasets by selecting for the ones that had a Cox proportional hazards p-value < 0.05 in every analysis. This yielded a list of 120 gene sets that were significantly associated with patient survival across every dataset we tested (Figure 3A, Additional file 1). Examples of some of these robustly predictive gene sets are shown in Figures 3B-E as survival curves generated from the van de Vijver dataset. The two gene sets in Figures 3B and C have a hazard ratio greater than 1 (hr = 1.33 and 1.26, respectively). In these gene sets, higher activity is associated with decreased survival, indicating that these gene sets are representative of a deleterious expression signature. The opposite is shown in the gene sets represented in Figures 3D and E, in which both gene sets have a hazard ratio less than 1 (hr = 0.75 and 0.83, respectively). For these gene sets, increased gene set activity is associated with increased survival time. This suggests that these gene sets are indicative of a gene expression profile exhibited by patients having a favorable prognosis.Figure 3

Bottom Line: Using the GSAS metric, we identified 120 gene sets that were significantly associated with patient survival in all datasets tested.Most interestingly, removal of the genes in this gene set from the gene pool on MSigDB resulted in a large reduction in the number of predictive gene sets, suggesting a prominent role for these genes in breast cancer progression.We used this metric to identify predictive gene sets and to construct a novel gene set containing genes heavily involved in cancer progression.

View Article: PubMed Central - PubMed

Affiliation: Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, 03755, USA. Frederick.S.Varn.Jr.GR@dartmouth.edu.

ABSTRACT

Background: Patient gene expression information has recently become a clinical feature used to evaluate breast cancer prognosis. The emergence of prognostic gene sets that take advantage of these data has led to a rich library of information that can be used to characterize the molecular nature of a patient's cancer. Identifying robust gene sets that are consistently predictive of a patient's clinical outcome has become one of the main challenges in the field.

Methods: We inputted our previously established BASE algorithm with patient gene expression data and gene sets from MSigDB to develop the gene set activity score (GSAS), a metric that quantitatively assesses a gene set's activity level in a given patient. We utilized this metric, along with patient time-to-event data, to perform survival analyses to identify the gene sets that were significantly correlated with patient survival. We then performed cross-dataset analyses to identify robust prognostic gene sets and to classify patients by metastasis status. Additionally, we created a gene set network based on component gene overlap to explore the relationship between gene sets derived from MSigDB. We developed a novel gene set based on this network's topology and applied the GSAS metric to characterize its role in patient survival.

Results: Using the GSAS metric, we identified 120 gene sets that were significantly associated with patient survival in all datasets tested. The gene overlap network analysis yielded a novel gene set enriched in genes shared by the robustly predictive gene sets. This gene set was highly correlated to patient survival when used alone. Most interestingly, removal of the genes in this gene set from the gene pool on MSigDB resulted in a large reduction in the number of predictive gene sets, suggesting a prominent role for these genes in breast cancer progression.

Conclusions: The GSAS metric provided a useful medium by which we systematically investigated how gene sets from MSigDB relate to breast cancer patient survival. We used this metric to identify predictive gene sets and to construct a novel gene set containing genes heavily involved in cancer progression.

No MeSH data available.


Related in: MedlinePlus