Limits...
MAID : an effect size based model for microarray data integration across laboratories and platforms.

Borozan I, Chen L, Paeper B, Heathcote JE, Edwards AM, Katze M, Zhang Z, McGilvray ID - BMC Bioinformatics (2008)

Bottom Line: Our results indicate that the proposed integration model produces an increase in statistical power for identifying differentially expressed genes when integrating data across experiments and when compared to other integration models.We also show that genes found to be significant using our data integration method are of direct biological relevance to the three experiments integrated.Here we propose an extension of the traditional effect size model to allow the integration of as many array experiments as possible with the aim of increasing the statistical power for identifying differentially expressed genes.

View Article: PubMed Central - HTML - PubMed

Affiliation: Banting and Best Department of Medical Research, University of Toronto, 112 College St, Toronto, ON M5G1L6, Canada. ivan.borozan@utoronto.ca

ABSTRACT

Background: Gene expression profiling has the potential to unravel molecular mechanisms behind gene regulation and identify gene targets for therapeutic interventions. As microarray technology matures, the number of microarray studies has increased, resulting in many different datasets available for any given disease. The increase in sensitivity and reliability of measurements of gene expression changes can be improved through a systematic integration of different microarray datasets that address the same or similar biological questions.

Results: Traditional effect size models can not be used to integrate array data that directly compare treatment to control samples expressed as log ratios of gene expressions. Here we extend the traditional effect size model to integrate as many array datasets as possible. The extended effect size model (MAID) can integrate any array datatype generated with either single or two channel arrays using either direct or indirect designs across different laboratories and platforms. The model uses two standardized indices, the standard effect size score for experiments with two groups of data, and a new standardized index that measures the difference in gene expression between treatment and control groups for one sample data with replicate arrays. The statistical significance of treatment effect across studies for each gene is determined by appropriate permutation methods depending on the type of data integrated. We apply our method to three different expression datasets from two different laboratories generated using three different array platforms and two different experimental designs. Our results indicate that the proposed integration model produces an increase in statistical power for identifying differentially expressed genes when integrating data across experiments and when compared to other integration models. We also show that genes found to be significant using our data integration method are of direct biological relevance to the three experiments integrated.

Conclusion: High-throughput genomics data provide a rich and complex source of information that could play a key role in deciphering intricate molecular networks behind disease. Here we propose an extension of the traditional effect size model to allow the integration of as many array experiments as possible with the aim of increasing the statistical power for identifying differentially expressed genes.

Show MeSH
Effect size correlations between experiments. Correlation plots of effect sizes between three experiments; 1a) R = 0.13 (Toronto-cDNA vs Washington-cDNA), 1b) R = 0.14 (Toronto-cDNA vs Washington-oligo), and 1c) R = 0.38 (Washington-oligo vs Washington-cDNA).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2483727&req=5

Figure 1: Effect size correlations between experiments. Correlation plots of effect sizes between three experiments; 1a) R = 0.13 (Toronto-cDNA vs Washington-cDNA), 1b) R = 0.14 (Toronto-cDNA vs Washington-oligo), and 1c) R = 0.38 (Washington-oligo vs Washington-cDNA).

Mentions: Before data integration was carried out, an exploratory data analysis as proposed in [20] was conducted to determine if there were any fundamental differences between experiments that would preclude data integration. As shown in Figure 1, low correlation coefficients were observed between estimated effect sizes of the three studies: , and . These low correlation coefficients highlight differences between the three experiments. In high-throughput microarray experiments, a common expectation is that the majority of genes in each study will show little or no difference between conditions. Figure 2 shows the distributions of z scores (see Methods section eq.15) in the three experiments, all of which are centered around zero. This finding indicates that most of the genes in each experiment show little or no differences between treatment and control samples. A significant deviation from zero in any of the three datasets, due to some large systematic effect, would be indicative of fundamental differences between experiments that could not be solved by statistical means. Thus even when low correlations between experiments are observed (for example due to a large number of genes having log2 expressions close to zero with random measurement error) this does not automatically imply that small sets of genes with significant effects across experiments would not be observed and that data integration should not be considered.


MAID : an effect size based model for microarray data integration across laboratories and platforms.

Borozan I, Chen L, Paeper B, Heathcote JE, Edwards AM, Katze M, Zhang Z, McGilvray ID - BMC Bioinformatics (2008)

Effect size correlations between experiments. Correlation plots of effect sizes between three experiments; 1a) R = 0.13 (Toronto-cDNA vs Washington-cDNA), 1b) R = 0.14 (Toronto-cDNA vs Washington-oligo), and 1c) R = 0.38 (Washington-oligo vs Washington-cDNA).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2483727&req=5

Figure 1: Effect size correlations between experiments. Correlation plots of effect sizes between three experiments; 1a) R = 0.13 (Toronto-cDNA vs Washington-cDNA), 1b) R = 0.14 (Toronto-cDNA vs Washington-oligo), and 1c) R = 0.38 (Washington-oligo vs Washington-cDNA).
Mentions: Before data integration was carried out, an exploratory data analysis as proposed in [20] was conducted to determine if there were any fundamental differences between experiments that would preclude data integration. As shown in Figure 1, low correlation coefficients were observed between estimated effect sizes of the three studies: , and . These low correlation coefficients highlight differences between the three experiments. In high-throughput microarray experiments, a common expectation is that the majority of genes in each study will show little or no difference between conditions. Figure 2 shows the distributions of z scores (see Methods section eq.15) in the three experiments, all of which are centered around zero. This finding indicates that most of the genes in each experiment show little or no differences between treatment and control samples. A significant deviation from zero in any of the three datasets, due to some large systematic effect, would be indicative of fundamental differences between experiments that could not be solved by statistical means. Thus even when low correlations between experiments are observed (for example due to a large number of genes having log2 expressions close to zero with random measurement error) this does not automatically imply that small sets of genes with significant effects across experiments would not be observed and that data integration should not be considered.

Bottom Line: Our results indicate that the proposed integration model produces an increase in statistical power for identifying differentially expressed genes when integrating data across experiments and when compared to other integration models.We also show that genes found to be significant using our data integration method are of direct biological relevance to the three experiments integrated.Here we propose an extension of the traditional effect size model to allow the integration of as many array experiments as possible with the aim of increasing the statistical power for identifying differentially expressed genes.

View Article: PubMed Central - HTML - PubMed

Affiliation: Banting and Best Department of Medical Research, University of Toronto, 112 College St, Toronto, ON M5G1L6, Canada. ivan.borozan@utoronto.ca

ABSTRACT

Background: Gene expression profiling has the potential to unravel molecular mechanisms behind gene regulation and identify gene targets for therapeutic interventions. As microarray technology matures, the number of microarray studies has increased, resulting in many different datasets available for any given disease. The increase in sensitivity and reliability of measurements of gene expression changes can be improved through a systematic integration of different microarray datasets that address the same or similar biological questions.

Results: Traditional effect size models can not be used to integrate array data that directly compare treatment to control samples expressed as log ratios of gene expressions. Here we extend the traditional effect size model to integrate as many array datasets as possible. The extended effect size model (MAID) can integrate any array datatype generated with either single or two channel arrays using either direct or indirect designs across different laboratories and platforms. The model uses two standardized indices, the standard effect size score for experiments with two groups of data, and a new standardized index that measures the difference in gene expression between treatment and control groups for one sample data with replicate arrays. The statistical significance of treatment effect across studies for each gene is determined by appropriate permutation methods depending on the type of data integrated. We apply our method to three different expression datasets from two different laboratories generated using three different array platforms and two different experimental designs. Our results indicate that the proposed integration model produces an increase in statistical power for identifying differentially expressed genes when integrating data across experiments and when compared to other integration models. We also show that genes found to be significant using our data integration method are of direct biological relevance to the three experiments integrated.

Conclusion: High-throughput genomics data provide a rich and complex source of information that could play a key role in deciphering intricate molecular networks behind disease. Here we propose an extension of the traditional effect size model to allow the integration of as many array experiments as possible with the aim of increasing the statistical power for identifying differentially expressed genes.

Show MeSH