Limits...
A first principles approach to differential expression in microarray data analysis.

Rubin RA - BMC Bioinformatics (2009)

Bottom Line: Here we take the approach of making the fewest assumptions about the structure of the microarray data.We applied the technique to the HGU-133A, HG-U95A, and "Golden Spike" spike-in data sets.The resulting receiver operating characteristic (ROC) curves compared favorably with other published results.

View Article: PubMed Central - HTML - PubMed

Affiliation: Mathematics Department, Whittier College, 13406 E. Philadelphia St., Whittier, CA 90608, USA. brubin698@earthlink.net

ABSTRACT

Background: The disparate results from the methods commonly used to determine differential expression in Affymetrix microarray experiments may well result from the wide variety of probe set and probe level models employed. Here we take the approach of making the fewest assumptions about the structure of the microarray data. Specifically, we only require that, under the hypothesis that a gene is not differentially expressed for specified conditions, for any probe position in the gene's probe set: a) the probe amplitudes are independent and identically distributed over the conditions, and b) the distributions of the replicated probe amplitudes are amenable to classical analysis of variance (ANOVA). Log-amplitudes that have been standardized within-chip meet these conditions well enough for our approach, which is to perform ANOVA across conditions for each probe position, and then take the median of the resulting (1 - p) values as a gene-level measure of differential expression.

Results: We applied the technique to the HGU-133A, HG-U95A, and "Golden Spike" spike-in data sets. The resulting receiver operating characteristic (ROC) curves compared favorably with other published results. This procedure is quite sensitive, so much so that it has revealed the presence of probe sets that might properly be called "unanticipated positives" rather than "false positives", because plots of these probe sets strongly suggest that they are differentially expressed.

Conclusion: The median ANOVA (1-p) approach presented here is a very simple methodology that does not depend on any specific probe level or probe models, and does not require any pre-processing other than within-chip standardization of probe level log amplitudes. Its performance is comparable to other published methods on the standard spike-in data sets, and has revealed the presence of new categories of probe sets that might properly be referred to as "unanticipated positives" and "unanticipated negatives" that need to be taken into account when using spiked-in data sets at "truthed" test beds.

Show MeSH

Related in: MedlinePlus

A comparison of p-values obtained from median ANOVA (1-p) and PLM processing of a chosen pair of HGU-133A test conditions. This chart shows that, based on their unadjusted p-values, median ANOVA (1-p) and probe level modeling (PLM) are in fair agreement for unexpressed genes and genes with a factor of two difference in initial concentrations. The tendency for median ANOVA (1-p) to produce, on average, somewhat smaller p-values for spiked-in genes and somewhat larger p-values for non-expressed genes may be due to inaccuracies in chip construction and/or probe level models. There is an even larger difference in p-values for genes whose concentration changes from 512 to 0 pM than for the RMA comparison, but again that has no impact on whether the genes are declared differentially expressed or not.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2749840&req=5

Figure 6: A comparison of p-values obtained from median ANOVA (1-p) and PLM processing of a chosen pair of HGU-133A test conditions. This chart shows that, based on their unadjusted p-values, median ANOVA (1-p) and probe level modeling (PLM) are in fair agreement for unexpressed genes and genes with a factor of two difference in initial concentrations. The tendency for median ANOVA (1-p) to produce, on average, somewhat smaller p-values for spiked-in genes and somewhat larger p-values for non-expressed genes may be due to inaccuracies in chip construction and/or probe level models. There is an even larger difference in p-values for genes whose concentration changes from 512 to 0 pM than for the RMA comparison, but again that has no impact on whether the genes are declared differentially expressed or not.

Mentions: Figures 5 and 6 show the relationships between the unadjusted p-values obtained from our median ANOVA (1-p) methodology and those obtained from RMA and probe level modeling (PLM) processing of Experimental Conditions 1 and 2 of the HGU-133A Spike-in Experiment. (We chose those conditions as an example because, as mentioned above, there is a considerably expanded set of highly differentially expressed genes involved in the comparison.) RMA and PLM unadjusted p-values were obtained using the Bioconductor [6]affylmGUI package. As these plots indicate, the ANOVA-p approach produces larger p-values in the extremely low p-value region (the region associated with the genes whose concentrations changes from 512 pM to 0), but this difference has no meaningful impact - after adjustment for multiple hypothesis testing all of these genes remain highly significant regardless of the methodology applied. As for the other genes, ANOVA-p and RMA are in reasonably close agreement, and ANOVA-p and PLM are in fair agreement (median ANOVA (1-p)'s gives somewhat smaller p-values for spiked-in genes with two-fold concentration differences, and PLM has on average somewhat smaller p-values for the non-spiked-in genes). Since the receiver operating characteristic (ROC) curves shown in the following figures provide essentially the same information about the relationships between processing methodologies in a more easily interpretable format, we have not included any other scatterplots comparing p-values.


A first principles approach to differential expression in microarray data analysis.

Rubin RA - BMC Bioinformatics (2009)

A comparison of p-values obtained from median ANOVA (1-p) and PLM processing of a chosen pair of HGU-133A test conditions. This chart shows that, based on their unadjusted p-values, median ANOVA (1-p) and probe level modeling (PLM) are in fair agreement for unexpressed genes and genes with a factor of two difference in initial concentrations. The tendency for median ANOVA (1-p) to produce, on average, somewhat smaller p-values for spiked-in genes and somewhat larger p-values for non-expressed genes may be due to inaccuracies in chip construction and/or probe level models. There is an even larger difference in p-values for genes whose concentration changes from 512 to 0 pM than for the RMA comparison, but again that has no impact on whether the genes are declared differentially expressed or not.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2749840&req=5

Figure 6: A comparison of p-values obtained from median ANOVA (1-p) and PLM processing of a chosen pair of HGU-133A test conditions. This chart shows that, based on their unadjusted p-values, median ANOVA (1-p) and probe level modeling (PLM) are in fair agreement for unexpressed genes and genes with a factor of two difference in initial concentrations. The tendency for median ANOVA (1-p) to produce, on average, somewhat smaller p-values for spiked-in genes and somewhat larger p-values for non-expressed genes may be due to inaccuracies in chip construction and/or probe level models. There is an even larger difference in p-values for genes whose concentration changes from 512 to 0 pM than for the RMA comparison, but again that has no impact on whether the genes are declared differentially expressed or not.
Mentions: Figures 5 and 6 show the relationships between the unadjusted p-values obtained from our median ANOVA (1-p) methodology and those obtained from RMA and probe level modeling (PLM) processing of Experimental Conditions 1 and 2 of the HGU-133A Spike-in Experiment. (We chose those conditions as an example because, as mentioned above, there is a considerably expanded set of highly differentially expressed genes involved in the comparison.) RMA and PLM unadjusted p-values were obtained using the Bioconductor [6]affylmGUI package. As these plots indicate, the ANOVA-p approach produces larger p-values in the extremely low p-value region (the region associated with the genes whose concentrations changes from 512 pM to 0), but this difference has no meaningful impact - after adjustment for multiple hypothesis testing all of these genes remain highly significant regardless of the methodology applied. As for the other genes, ANOVA-p and RMA are in reasonably close agreement, and ANOVA-p and PLM are in fair agreement (median ANOVA (1-p)'s gives somewhat smaller p-values for spiked-in genes with two-fold concentration differences, and PLM has on average somewhat smaller p-values for the non-spiked-in genes). Since the receiver operating characteristic (ROC) curves shown in the following figures provide essentially the same information about the relationships between processing methodologies in a more easily interpretable format, we have not included any other scatterplots comparing p-values.

Bottom Line: Here we take the approach of making the fewest assumptions about the structure of the microarray data.We applied the technique to the HGU-133A, HG-U95A, and "Golden Spike" spike-in data sets.The resulting receiver operating characteristic (ROC) curves compared favorably with other published results.

View Article: PubMed Central - HTML - PubMed

Affiliation: Mathematics Department, Whittier College, 13406 E. Philadelphia St., Whittier, CA 90608, USA. brubin698@earthlink.net

ABSTRACT

Background: The disparate results from the methods commonly used to determine differential expression in Affymetrix microarray experiments may well result from the wide variety of probe set and probe level models employed. Here we take the approach of making the fewest assumptions about the structure of the microarray data. Specifically, we only require that, under the hypothesis that a gene is not differentially expressed for specified conditions, for any probe position in the gene's probe set: a) the probe amplitudes are independent and identically distributed over the conditions, and b) the distributions of the replicated probe amplitudes are amenable to classical analysis of variance (ANOVA). Log-amplitudes that have been standardized within-chip meet these conditions well enough for our approach, which is to perform ANOVA across conditions for each probe position, and then take the median of the resulting (1 - p) values as a gene-level measure of differential expression.

Results: We applied the technique to the HGU-133A, HG-U95A, and "Golden Spike" spike-in data sets. The resulting receiver operating characteristic (ROC) curves compared favorably with other published results. This procedure is quite sensitive, so much so that it has revealed the presence of probe sets that might properly be called "unanticipated positives" rather than "false positives", because plots of these probe sets strongly suggest that they are differentially expressed.

Conclusion: The median ANOVA (1-p) approach presented here is a very simple methodology that does not depend on any specific probe level or probe models, and does not require any pre-processing other than within-chip standardization of probe level log amplitudes. Its performance is comparable to other published methods on the standard spike-in data sets, and has revealed the presence of new categories of probe sets that might properly be referred to as "unanticipated positives" and "unanticipated negatives" that need to be taken into account when using spiked-in data sets at "truthed" test beds.

Show MeSH
Related in: MedlinePlus