Limits...
The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets - improving meta-analysis and prediction of prognosis.

Sims AH, Smethurst GJ, Hey Y, Okoniewski MJ, Pepper SD, Howell A, Miller CJ, Clarke RB - BMC Med Genomics (2008)

Bottom Line: Simple batch mean-centering was found to significantly reduce the level of inter-experimental variation, allowing raw transcript levels to be compared across datasets with confidence.However, this is highly dependent upon the composition of the datasets and patient characteristics.When these are reconciled, raw data can be directly integrated from different gene expression datasets leading to new biological findings with increased statistical power.

View Article: PubMed Central - HTML - PubMed

Affiliation: Applied Bioinformatics of Cancer Research Group, Breakthrough Research Unit, Edinburgh Cancer Research Centre, Western General Hospital, Crewe Road South, Edinburgh, EH4 2XR, UK. andrew.sims@ed.ac.uk

ABSTRACT

Background: The number of gene expression studies in the public domain is rapidly increasing, representing a highly valuable resource. However, dataset-specific bias precludes meta-analysis at the raw transcript level, even when the RNA is from comparable sources and has been processed on the same microarray platform using similar protocols. Here, we demonstrate, using Affymetrix data, that much of this bias can be removed, allowing multiple datasets to be legitimately combined for meaningful meta-analyses.

Results: A series of validation datasets comparing breast cancer and normal breast cell lines (MCF7 and MCF10A) were generated to examine the variability between datasets generated using different amounts of starting RNA, alternative protocols, different generations of Affymetrix GeneChip or scanning hardware. We demonstrate that systematic, multiplicative biases are introduced at the RNA, hybridization and image-capture stages of a microarray experiment. Simple batch mean-centering was found to significantly reduce the level of inter-experimental variation, allowing raw transcript levels to be compared across datasets with confidence. By accounting for dataset-specific bias, we were able to assemble the largest gene expression dataset of primary breast tumours to-date (1107), from six previously published studies. Using this meta-dataset, we demonstrate that combining greater numbers of datasets or tumours leads to a greater overlap in differentially expressed genes and more accurate prognostic predictions. However, this is highly dependent upon the composition of the datasets and patient characteristics.

Conclusion: Multiplicative, systematic biases are introduced at many stages of microarray experiments. When these are reconciled, raw data can be directly integrated from different gene expression datasets leading to new biological findings with increased statistical power.

No MeSH data available.


Related in: MedlinePlus

Combining datasets or tumours and mean-centering significantly increases prognostic prediction. A, Before mean batch-centering. B, After mean batch-centering. The R2 statistic (Cox proportional hazards model) is an assessment of the performance of the predictor generated using each combination of training datasets and the remaining test datasets, generated using supervised principal components analysis. Median values are used where a training dataset was used to assess more than one test dataset (up to 5). R2 and p-value results for all possible combinations of training datasets and test datasets (1016) are given in the matrix in Additional File 6.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2563019&req=5

Figure 4: Combining datasets or tumours and mean-centering significantly increases prognostic prediction. A, Before mean batch-centering. B, After mean batch-centering. The R2 statistic (Cox proportional hazards model) is an assessment of the performance of the predictor generated using each combination of training datasets and the remaining test datasets, generated using supervised principal components analysis. Median values are used where a training dataset was used to assess more than one test dataset (up to 5). R2 and p-value results for all possible combinations of training datasets and test datasets (1016) are given in the matrix in Additional File 6.

Mentions: A growing body of evidence has accumulated, supporting the notion that gene expression profiling of primary breast tumours can be used to stratify patients by subtype and the likelihood of disease progression (reviewed in [22]). The approaches have included both unsupervised (intrinsic gene set [6-8]) and supervised (genes associated with patient follow up [17,21,23]) methods, along with studies of cancer-associated pathways or tumor characteristics [24-27], in all cases these signatures appear to predict recurrence, despite the lack of overlap in their respective profiles or signatures [28]. In order to establish whether integrating multiple datasets can improve prognostic prediction, the six published datasets (described above) were used individually or in combination as 'training sets' for supervised principal components analysis [29,30]. This method has been shown to produce more accurate predictions than several competing methods on both simulated and real microarray datasets [29,30]. Using the Superpc [30] package for R [31], a Cox proportional hazards model was fitted to each predictor (generated for all combinations of one to five datasets used as the 'training set') and cross validation curves were plotted to determine the optimum threshold for the predictor of survival as described previously [30]. The 1st supervised principal component was found to be the most significant in the vast majority of cases, which is consistent with the hypothesis that recurrence is an inherent property of primary tumour gene expression (examples of cross-validation and survival curves are shown in Additional File 5). The remaining dataset(s) were used as a test set for each predictor and an R2 statistic was computed to assess the performance. Combining greater numbers of datasets or tumours significantly improves prediction of prognosis based upon gene expression data (Fig. 4). Mean-centering of the datasets significantly increased the correlation between the supervised principal components and clinical follow up, therefore improving prognostic performance. It is clear that the predictive power of some combinations of training and test sets is more reliable than others. Although only a limited number of patient and tumour characteristics were available (Table 2), it seems that the most accurate predictions are achieved for test datasets that have characteristics most similar to those of the individual or combined training dataset. R2 statistic and p-values (log rank) for all possible combinations of training datasets and test datasets are given in Additional File 6.


The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets - improving meta-analysis and prediction of prognosis.

Sims AH, Smethurst GJ, Hey Y, Okoniewski MJ, Pepper SD, Howell A, Miller CJ, Clarke RB - BMC Med Genomics (2008)

Combining datasets or tumours and mean-centering significantly increases prognostic prediction. A, Before mean batch-centering. B, After mean batch-centering. The R2 statistic (Cox proportional hazards model) is an assessment of the performance of the predictor generated using each combination of training datasets and the remaining test datasets, generated using supervised principal components analysis. Median values are used where a training dataset was used to assess more than one test dataset (up to 5). R2 and p-value results for all possible combinations of training datasets and test datasets (1016) are given in the matrix in Additional File 6.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2563019&req=5

Figure 4: Combining datasets or tumours and mean-centering significantly increases prognostic prediction. A, Before mean batch-centering. B, After mean batch-centering. The R2 statistic (Cox proportional hazards model) is an assessment of the performance of the predictor generated using each combination of training datasets and the remaining test datasets, generated using supervised principal components analysis. Median values are used where a training dataset was used to assess more than one test dataset (up to 5). R2 and p-value results for all possible combinations of training datasets and test datasets (1016) are given in the matrix in Additional File 6.
Mentions: A growing body of evidence has accumulated, supporting the notion that gene expression profiling of primary breast tumours can be used to stratify patients by subtype and the likelihood of disease progression (reviewed in [22]). The approaches have included both unsupervised (intrinsic gene set [6-8]) and supervised (genes associated with patient follow up [17,21,23]) methods, along with studies of cancer-associated pathways or tumor characteristics [24-27], in all cases these signatures appear to predict recurrence, despite the lack of overlap in their respective profiles or signatures [28]. In order to establish whether integrating multiple datasets can improve prognostic prediction, the six published datasets (described above) were used individually or in combination as 'training sets' for supervised principal components analysis [29,30]. This method has been shown to produce more accurate predictions than several competing methods on both simulated and real microarray datasets [29,30]. Using the Superpc [30] package for R [31], a Cox proportional hazards model was fitted to each predictor (generated for all combinations of one to five datasets used as the 'training set') and cross validation curves were plotted to determine the optimum threshold for the predictor of survival as described previously [30]. The 1st supervised principal component was found to be the most significant in the vast majority of cases, which is consistent with the hypothesis that recurrence is an inherent property of primary tumour gene expression (examples of cross-validation and survival curves are shown in Additional File 5). The remaining dataset(s) were used as a test set for each predictor and an R2 statistic was computed to assess the performance. Combining greater numbers of datasets or tumours significantly improves prediction of prognosis based upon gene expression data (Fig. 4). Mean-centering of the datasets significantly increased the correlation between the supervised principal components and clinical follow up, therefore improving prognostic performance. It is clear that the predictive power of some combinations of training and test sets is more reliable than others. Although only a limited number of patient and tumour characteristics were available (Table 2), it seems that the most accurate predictions are achieved for test datasets that have characteristics most similar to those of the individual or combined training dataset. R2 statistic and p-values (log rank) for all possible combinations of training datasets and test datasets are given in Additional File 6.

Bottom Line: Simple batch mean-centering was found to significantly reduce the level of inter-experimental variation, allowing raw transcript levels to be compared across datasets with confidence.However, this is highly dependent upon the composition of the datasets and patient characteristics.When these are reconciled, raw data can be directly integrated from different gene expression datasets leading to new biological findings with increased statistical power.

View Article: PubMed Central - HTML - PubMed

Affiliation: Applied Bioinformatics of Cancer Research Group, Breakthrough Research Unit, Edinburgh Cancer Research Centre, Western General Hospital, Crewe Road South, Edinburgh, EH4 2XR, UK. andrew.sims@ed.ac.uk

ABSTRACT

Background: The number of gene expression studies in the public domain is rapidly increasing, representing a highly valuable resource. However, dataset-specific bias precludes meta-analysis at the raw transcript level, even when the RNA is from comparable sources and has been processed on the same microarray platform using similar protocols. Here, we demonstrate, using Affymetrix data, that much of this bias can be removed, allowing multiple datasets to be legitimately combined for meaningful meta-analyses.

Results: A series of validation datasets comparing breast cancer and normal breast cell lines (MCF7 and MCF10A) were generated to examine the variability between datasets generated using different amounts of starting RNA, alternative protocols, different generations of Affymetrix GeneChip or scanning hardware. We demonstrate that systematic, multiplicative biases are introduced at the RNA, hybridization and image-capture stages of a microarray experiment. Simple batch mean-centering was found to significantly reduce the level of inter-experimental variation, allowing raw transcript levels to be compared across datasets with confidence. By accounting for dataset-specific bias, we were able to assemble the largest gene expression dataset of primary breast tumours to-date (1107), from six previously published studies. Using this meta-dataset, we demonstrate that combining greater numbers of datasets or tumours leads to a greater overlap in differentially expressed genes and more accurate prognostic predictions. However, this is highly dependent upon the composition of the datasets and patient characteristics.

Conclusion: Multiplicative, systematic biases are introduced at many stages of microarray experiments. When these are reconciled, raw data can be directly integrated from different gene expression datasets leading to new biological findings with increased statistical power.

No MeSH data available.


Related in: MedlinePlus