Limits...
Cancer Bioinformatic Methods to Infer Meaningful Data From Small-Size Cohorts.

Bennani-Baiti N, Bennani-Baiti IM - Cancer Inform (2015)

Bottom Line: Whole-genome analyses have uncovered that most cancer-relevant genes cluster into 12 signaling pathways.Publicly available genomic data sets constitute a wealth of gene mining opportunities for hypothesis generation and testing.However, the increasingly recognized genetic and epigenetic inter- and intratumor heterogeneity, combined with the preponderance of small-size cohorts, hamper reliable analysis and discovery.

View Article: PubMed Central - PubMed

Affiliation: Division of Hematology, Mayo Clinic, Rochester, MN 55905, USA.

ABSTRACT
Whole-genome analyses have uncovered that most cancer-relevant genes cluster into 12 signaling pathways. Knowledge of the signaling pathways and associated gene signatures not only allows us to understand the mechanisms of oncogenesis inherent to specific cancers but also provides us with drug targets, molecular diagnostic and prognosis factors, as well as biomarkers for patient risk stratification and treatment. Publicly available genomic data sets constitute a wealth of gene mining opportunities for hypothesis generation and testing. However, the increasingly recognized genetic and epigenetic inter- and intratumor heterogeneity, combined with the preponderance of small-size cohorts, hamper reliable analysis and discovery. Here, we review two methods that are used to infer meaningful biological events from small-size data sets and discuss some of their applications and limitations.

No MeSH data available.


Related in: MedlinePlus

Probability density distribution of cancer gene data sets of high-incidence adult cancers. All cancer data sets were retrieved from GEO (query performed on August 15, 2015) and plotted against sample size (x axis). These included (A) breast cancer (number of data sets n = 110), (B) prostate cancer (n = 43), (C) lung cancer (n = 25), and (D) colorectal cancer (n = 37).
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4631160&req=5

f2-cin-14-2015-131: Probability density distribution of cancer gene data sets of high-incidence adult cancers. All cancer data sets were retrieved from GEO (query performed on August 15, 2015) and plotted against sample size (x axis). These included (A) breast cancer (number of data sets n = 110), (B) prostate cancer (n = 43), (C) lung cancer (n = 25), and (D) colorectal cancer (n = 37).

Mentions: It is common knowledge that most childhood cancer cohorts are relatively small in size. This is not only due to the fact that these diseases are relatively rare in nature but also because less funding is devoted to research on these neoplasms as compared to their adult counterparts. Thus, for example, the combined NIH budget for all types of pediatric sarcomas is only about 1/15th of the budget allocated to breast cancer alone.29 But what about the cohort size of gene data sets of more frequent childhood cancers (eg, leukemia with 88 cases/year/million in 1–4-year-old children30) or adults cancers (80–690 cancers/year/million in men and 73–724 cases/year/million in women for the top 10 cancer types; based on our estimates taking into account the current US population projection31 and cancer frequencies in adults in the US population in 201532)? To address this question, we ran a meta-analysis of all cancer gene data sets deposited to date in GEO. As shown in Figure 1, the majority of data sets contained 50 or less samples and qualified as small-size cohorts (median ; range: 4–192). As may be expected, the largest size data sets were mostly those of cancers with high-incidence and research funding and constituted seven of the 10 largest size data sets (not shown). Interestingly, however, probability density distribution analyses of individual cancers with the highest incidences and research funding, such as those of the breast (representing 29% of all new annual cases in women32; ; range: 4–116), prostate (26% new cases in men32; ; range: 4–171), lung (13%–14% new cases in both genders32; ; range: 4–192), and colorectum (8% new cases in both genders32; ; range: 4–104), show that these too are mostly made up of small-size cohorts (Fig. 2). The problem posed by small-size cohorts affects, therefore, the majority of data sets across the board in both childhood and adult cancers. Larger size data sets can of course be found in the collections of The Cancer Genome Atlas of the National Cancer Institute (NCI)/National Human Genome Research Institute (NHGRI) and of the Wellcome Trust Sanger Institute. These are, however, also subject to two overriding limitations. The first one relates to the aforementioned frequently observed tumor heterogeneity from which one can presume that many large-size cohort data sets are essentially a heterogeneous collection of varying numbers of relatively homogenous smaller size cohorts. Second, although these initiatives have made significant strides at increasing sample size in high-incidence diseases, they still somewhat lag behind in low incidence or so-called orphan diseases, to which many cancers belong.


Cancer Bioinformatic Methods to Infer Meaningful Data From Small-Size Cohorts.

Bennani-Baiti N, Bennani-Baiti IM - Cancer Inform (2015)

Probability density distribution of cancer gene data sets of high-incidence adult cancers. All cancer data sets were retrieved from GEO (query performed on August 15, 2015) and plotted against sample size (x axis). These included (A) breast cancer (number of data sets n = 110), (B) prostate cancer (n = 43), (C) lung cancer (n = 25), and (D) colorectal cancer (n = 37).
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4631160&req=5

f2-cin-14-2015-131: Probability density distribution of cancer gene data sets of high-incidence adult cancers. All cancer data sets were retrieved from GEO (query performed on August 15, 2015) and plotted against sample size (x axis). These included (A) breast cancer (number of data sets n = 110), (B) prostate cancer (n = 43), (C) lung cancer (n = 25), and (D) colorectal cancer (n = 37).
Mentions: It is common knowledge that most childhood cancer cohorts are relatively small in size. This is not only due to the fact that these diseases are relatively rare in nature but also because less funding is devoted to research on these neoplasms as compared to their adult counterparts. Thus, for example, the combined NIH budget for all types of pediatric sarcomas is only about 1/15th of the budget allocated to breast cancer alone.29 But what about the cohort size of gene data sets of more frequent childhood cancers (eg, leukemia with 88 cases/year/million in 1–4-year-old children30) or adults cancers (80–690 cancers/year/million in men and 73–724 cases/year/million in women for the top 10 cancer types; based on our estimates taking into account the current US population projection31 and cancer frequencies in adults in the US population in 201532)? To address this question, we ran a meta-analysis of all cancer gene data sets deposited to date in GEO. As shown in Figure 1, the majority of data sets contained 50 or less samples and qualified as small-size cohorts (median ; range: 4–192). As may be expected, the largest size data sets were mostly those of cancers with high-incidence and research funding and constituted seven of the 10 largest size data sets (not shown). Interestingly, however, probability density distribution analyses of individual cancers with the highest incidences and research funding, such as those of the breast (representing 29% of all new annual cases in women32; ; range: 4–116), prostate (26% new cases in men32; ; range: 4–171), lung (13%–14% new cases in both genders32; ; range: 4–192), and colorectum (8% new cases in both genders32; ; range: 4–104), show that these too are mostly made up of small-size cohorts (Fig. 2). The problem posed by small-size cohorts affects, therefore, the majority of data sets across the board in both childhood and adult cancers. Larger size data sets can of course be found in the collections of The Cancer Genome Atlas of the National Cancer Institute (NCI)/National Human Genome Research Institute (NHGRI) and of the Wellcome Trust Sanger Institute. These are, however, also subject to two overriding limitations. The first one relates to the aforementioned frequently observed tumor heterogeneity from which one can presume that many large-size cohort data sets are essentially a heterogeneous collection of varying numbers of relatively homogenous smaller size cohorts. Second, although these initiatives have made significant strides at increasing sample size in high-incidence diseases, they still somewhat lag behind in low incidence or so-called orphan diseases, to which many cancers belong.

Bottom Line: Whole-genome analyses have uncovered that most cancer-relevant genes cluster into 12 signaling pathways.Publicly available genomic data sets constitute a wealth of gene mining opportunities for hypothesis generation and testing.However, the increasingly recognized genetic and epigenetic inter- and intratumor heterogeneity, combined with the preponderance of small-size cohorts, hamper reliable analysis and discovery.

View Article: PubMed Central - PubMed

Affiliation: Division of Hematology, Mayo Clinic, Rochester, MN 55905, USA.

ABSTRACT
Whole-genome analyses have uncovered that most cancer-relevant genes cluster into 12 signaling pathways. Knowledge of the signaling pathways and associated gene signatures not only allows us to understand the mechanisms of oncogenesis inherent to specific cancers but also provides us with drug targets, molecular diagnostic and prognosis factors, as well as biomarkers for patient risk stratification and treatment. Publicly available genomic data sets constitute a wealth of gene mining opportunities for hypothesis generation and testing. However, the increasingly recognized genetic and epigenetic inter- and intratumor heterogeneity, combined with the preponderance of small-size cohorts, hamper reliable analysis and discovery. Here, we review two methods that are used to infer meaningful biological events from small-size data sets and discuss some of their applications and limitations.

No MeSH data available.


Related in: MedlinePlus