Limits...
Integrative analysis of multiple gene expression profiles with quality-adjusted effect size models.

Hu P, Greenwood CM, Beyene J - BMC Bioinformatics (2005)

Bottom Line: We extended traditional effect size models to combine information from different microarray datasets by incorporating a quality measure for each gene in each study into the effect size estimation.We live in a high-throughput era where technologies constantly change leaving behind a trail of data with different forms, shapes and sizes.Statistical and computational methodologies are therefore critical for extracting the most out of these related but not identical sources of data.

View Article: PubMed Central - HTML - PubMed

Affiliation: The Hospital for Sick Children Research Institute, 555 University Ave,, Toronto, ON, M5G 1X8, Canada. phu@sickkids.ca

ABSTRACT

Background: With the explosion of microarray studies, an enormous amount of data is being produced. Systematic integration of gene expression data from different sources increases statistical power of detecting differentially expressed genes and allows assessment of heterogeneity. The challenge, however, is in designing and implementing efficient analytic methodologies for combination of data generated by different research groups.

Results: We extended traditional effect size models to combine information from different microarray datasets by incorporating a quality measure for each gene in each study into the effect size estimation. We illustrated our method by integrating two datasets generated using different Affymetrix oligonucleotide types. Our results indicate that the proposed quality-adjusted weighting strategy for modelling inter-study variation of gene expression profiles not only increases consistency and decreases heterogeneous results between these two datasets, but also identifies many more differentially expressed genes than methods proposed previously.

Conclusion: Data integration and synthesis is becoming increasingly important. We live in a high-throughput era where technologies constantly change leaving behind a trail of data with different forms, shapes and sizes. Statistical and computational methodologies are therefore critical for extracting the most out of these related but not identical sources of data.

Show MeSH
Scatter plots of (a) quality-weighted and (b) quality-unweighted mean expression values for the 6124 common probe sets in the Harvard and Michigan datasets based on normal (healthy) samples.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1173085&req=5

Figure 1: Scatter plots of (a) quality-weighted and (b) quality-unweighted mean expression values for the 6124 common probe sets in the Harvard and Michigan datasets based on normal (healthy) samples.

Mentions: We used two data sets consisting of gene expression profiles in lung cancer and normal subjects. These datasets were collected using different chip types of the Affymetrix oligonucleotide microarrays and were conducted by two research groups, one from Harvard and the other from Michigan (see Methods section for details). A list of 6124 common probe sets found in the two datasets was used for data analysis in this study [10]. We developed a quality weight for each gene in each study by modeling the log of the detection p-values with an exponential distribution, and then summarizing across arrays and groups within each study (see Methods). In order to visualize the effect of the proposed quality weighting, we calculated the mean expression value of each probe set across all normal samples, where the expression variation presumably is less heterogeneous than among the cancer samples. The mean expression value of a probe set in a study is the estimated average of the probe set's intensity values across all normal samples in the study. Figure 1 shows the scatter plot of the average expression values of the probe sets in the Harvard dataset plotted against that of the Michigan dataset: (a) weighted by the quality score (see Methods section for our definition of the quality score), and (b) un-weighted. This plot is intended to be illustrative only – our algorithm weights the test statistics, rather than the gene expression measures. In Figure 1(a), it can be seen, as expected, that many of the genes with low levels of expression are associated with low quality weights.


Integrative analysis of multiple gene expression profiles with quality-adjusted effect size models.

Hu P, Greenwood CM, Beyene J - BMC Bioinformatics (2005)

Scatter plots of (a) quality-weighted and (b) quality-unweighted mean expression values for the 6124 common probe sets in the Harvard and Michigan datasets based on normal (healthy) samples.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1173085&req=5

Figure 1: Scatter plots of (a) quality-weighted and (b) quality-unweighted mean expression values for the 6124 common probe sets in the Harvard and Michigan datasets based on normal (healthy) samples.
Mentions: We used two data sets consisting of gene expression profiles in lung cancer and normal subjects. These datasets were collected using different chip types of the Affymetrix oligonucleotide microarrays and were conducted by two research groups, one from Harvard and the other from Michigan (see Methods section for details). A list of 6124 common probe sets found in the two datasets was used for data analysis in this study [10]. We developed a quality weight for each gene in each study by modeling the log of the detection p-values with an exponential distribution, and then summarizing across arrays and groups within each study (see Methods). In order to visualize the effect of the proposed quality weighting, we calculated the mean expression value of each probe set across all normal samples, where the expression variation presumably is less heterogeneous than among the cancer samples. The mean expression value of a probe set in a study is the estimated average of the probe set's intensity values across all normal samples in the study. Figure 1 shows the scatter plot of the average expression values of the probe sets in the Harvard dataset plotted against that of the Michigan dataset: (a) weighted by the quality score (see Methods section for our definition of the quality score), and (b) un-weighted. This plot is intended to be illustrative only – our algorithm weights the test statistics, rather than the gene expression measures. In Figure 1(a), it can be seen, as expected, that many of the genes with low levels of expression are associated with low quality weights.

Bottom Line: We extended traditional effect size models to combine information from different microarray datasets by incorporating a quality measure for each gene in each study into the effect size estimation.We live in a high-throughput era where technologies constantly change leaving behind a trail of data with different forms, shapes and sizes.Statistical and computational methodologies are therefore critical for extracting the most out of these related but not identical sources of data.

View Article: PubMed Central - HTML - PubMed

Affiliation: The Hospital for Sick Children Research Institute, 555 University Ave,, Toronto, ON, M5G 1X8, Canada. phu@sickkids.ca

ABSTRACT

Background: With the explosion of microarray studies, an enormous amount of data is being produced. Systematic integration of gene expression data from different sources increases statistical power of detecting differentially expressed genes and allows assessment of heterogeneity. The challenge, however, is in designing and implementing efficient analytic methodologies for combination of data generated by different research groups.

Results: We extended traditional effect size models to combine information from different microarray datasets by incorporating a quality measure for each gene in each study into the effect size estimation. We illustrated our method by integrating two datasets generated using different Affymetrix oligonucleotide types. Our results indicate that the proposed quality-adjusted weighting strategy for modelling inter-study variation of gene expression profiles not only increases consistency and decreases heterogeneous results between these two datasets, but also identifies many more differentially expressed genes than methods proposed previously.

Conclusion: Data integration and synthesis is becoming increasingly important. We live in a high-throughput era where technologies constantly change leaving behind a trail of data with different forms, shapes and sizes. Statistical and computational methodologies are therefore critical for extracting the most out of these related but not identical sources of data.

Show MeSH