Limits...
MAID : an effect size based model for microarray data integration across laboratories and platforms.

Borozan I, Chen L, Paeper B, Heathcote JE, Edwards AM, Katze M, Zhang Z, McGilvray ID - BMC Bioinformatics (2008)

Bottom Line: Our results indicate that the proposed integration model produces an increase in statistical power for identifying differentially expressed genes when integrating data across experiments and when compared to other integration models.We also show that genes found to be significant using our data integration method are of direct biological relevance to the three experiments integrated.Here we propose an extension of the traditional effect size model to allow the integration of as many array experiments as possible with the aim of increasing the statistical power for identifying differentially expressed genes.

View Article: PubMed Central - HTML - PubMed

Affiliation: Banting and Best Department of Medical Research, University of Toronto, 112 College St, Toronto, ON M5G1L6, Canada. ivan.borozan@utoronto.ca

ABSTRACT

Background: Gene expression profiling has the potential to unravel molecular mechanisms behind gene regulation and identify gene targets for therapeutic interventions. As microarray technology matures, the number of microarray studies has increased, resulting in many different datasets available for any given disease. The increase in sensitivity and reliability of measurements of gene expression changes can be improved through a systematic integration of different microarray datasets that address the same or similar biological questions.

Results: Traditional effect size models can not be used to integrate array data that directly compare treatment to control samples expressed as log ratios of gene expressions. Here we extend the traditional effect size model to integrate as many array datasets as possible. The extended effect size model (MAID) can integrate any array datatype generated with either single or two channel arrays using either direct or indirect designs across different laboratories and platforms. The model uses two standardized indices, the standard effect size score for experiments with two groups of data, and a new standardized index that measures the difference in gene expression between treatment and control groups for one sample data with replicate arrays. The statistical significance of treatment effect across studies for each gene is determined by appropriate permutation methods depending on the type of data integrated. We apply our method to three different expression datasets from two different laboratories generated using three different array platforms and two different experimental designs. Our results indicate that the proposed integration model produces an increase in statistical power for identifying differentially expressed genes when integrating data across experiments and when compared to other integration models. We also show that genes found to be significant using our data integration method are of direct biological relevance to the three experiments integrated.

Conclusion: High-throughput genomics data provide a rich and complex source of information that could play a key role in deciphering intricate molecular networks behind disease. Here we propose an extension of the traditional effect size model to allow the integration of as many array experiments as possible with the aim of increasing the statistical power for identifying differentially expressed genes.

Show MeSH
Meta-analysis false discovery rate. The number of genes vs their significance for individual studies and for the integrated study.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2483727&req=5

Figure 5: Meta-analysis false discovery rate. The number of genes vs their significance for individual studies and for the integrated study.

Mentions: In order to test for homogeneity between datasets, we used the Cochran Q statistics given in eq.10 (see the Methods section). The results of the test are shown in Figure 4. The observed Q values from the three experiments deviate significantly from the expected quantiles of the distribution, suggesting that the three datasets are heterogeneous. Heterogeneity indicates significant variability between studies that requires a random effect model (REM) to be fitted. When applied to our pre-processed datasets, the REM model found a set of 451 significant genes with FDR ≤ 0.05. In order to asses the advantage of integrating these three datasets together, we first determined the number of genes that had an FDR ≤ 0.05 in the meta-analysis study but for which the FDR in all three studies was higher than the FDR in the meta-analysis study. Of the total of 451 genes in the meta-analysis study, we found 237 to satisfy this criterion. We designated these genes as integration-driven discovery (IDD) genes as first introduced in [15]. Figure 5 shows a plot of the gene number versus FDR (FDR ≤ 0.05) for each independent dataset and demonstrates that the largest number of significant genes is observed in the meta-analysis. This increase in the number of significant genes is an indication of the potential benefit in integrating these three datasets using our model.


MAID : an effect size based model for microarray data integration across laboratories and platforms.

Borozan I, Chen L, Paeper B, Heathcote JE, Edwards AM, Katze M, Zhang Z, McGilvray ID - BMC Bioinformatics (2008)

Meta-analysis false discovery rate. The number of genes vs their significance for individual studies and for the integrated study.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2483727&req=5

Figure 5: Meta-analysis false discovery rate. The number of genes vs their significance for individual studies and for the integrated study.
Mentions: In order to test for homogeneity between datasets, we used the Cochran Q statistics given in eq.10 (see the Methods section). The results of the test are shown in Figure 4. The observed Q values from the three experiments deviate significantly from the expected quantiles of the distribution, suggesting that the three datasets are heterogeneous. Heterogeneity indicates significant variability between studies that requires a random effect model (REM) to be fitted. When applied to our pre-processed datasets, the REM model found a set of 451 significant genes with FDR ≤ 0.05. In order to asses the advantage of integrating these three datasets together, we first determined the number of genes that had an FDR ≤ 0.05 in the meta-analysis study but for which the FDR in all three studies was higher than the FDR in the meta-analysis study. Of the total of 451 genes in the meta-analysis study, we found 237 to satisfy this criterion. We designated these genes as integration-driven discovery (IDD) genes as first introduced in [15]. Figure 5 shows a plot of the gene number versus FDR (FDR ≤ 0.05) for each independent dataset and demonstrates that the largest number of significant genes is observed in the meta-analysis. This increase in the number of significant genes is an indication of the potential benefit in integrating these three datasets using our model.

Bottom Line: Our results indicate that the proposed integration model produces an increase in statistical power for identifying differentially expressed genes when integrating data across experiments and when compared to other integration models.We also show that genes found to be significant using our data integration method are of direct biological relevance to the three experiments integrated.Here we propose an extension of the traditional effect size model to allow the integration of as many array experiments as possible with the aim of increasing the statistical power for identifying differentially expressed genes.

View Article: PubMed Central - HTML - PubMed

Affiliation: Banting and Best Department of Medical Research, University of Toronto, 112 College St, Toronto, ON M5G1L6, Canada. ivan.borozan@utoronto.ca

ABSTRACT

Background: Gene expression profiling has the potential to unravel molecular mechanisms behind gene regulation and identify gene targets for therapeutic interventions. As microarray technology matures, the number of microarray studies has increased, resulting in many different datasets available for any given disease. The increase in sensitivity and reliability of measurements of gene expression changes can be improved through a systematic integration of different microarray datasets that address the same or similar biological questions.

Results: Traditional effect size models can not be used to integrate array data that directly compare treatment to control samples expressed as log ratios of gene expressions. Here we extend the traditional effect size model to integrate as many array datasets as possible. The extended effect size model (MAID) can integrate any array datatype generated with either single or two channel arrays using either direct or indirect designs across different laboratories and platforms. The model uses two standardized indices, the standard effect size score for experiments with two groups of data, and a new standardized index that measures the difference in gene expression between treatment and control groups for one sample data with replicate arrays. The statistical significance of treatment effect across studies for each gene is determined by appropriate permutation methods depending on the type of data integrated. We apply our method to three different expression datasets from two different laboratories generated using three different array platforms and two different experimental designs. Our results indicate that the proposed integration model produces an increase in statistical power for identifying differentially expressed genes when integrating data across experiments and when compared to other integration models. We also show that genes found to be significant using our data integration method are of direct biological relevance to the three experiments integrated.

Conclusion: High-throughput genomics data provide a rich and complex source of information that could play a key role in deciphering intricate molecular networks behind disease. Here we propose an extension of the traditional effect size model to allow the integration of as many array experiments as possible with the aim of increasing the statistical power for identifying differentially expressed genes.

Show MeSH