Limits...
A first principles approach to differential expression in microarray data analysis.

Rubin RA - BMC Bioinformatics (2009)

Bottom Line: Here we take the approach of making the fewest assumptions about the structure of the microarray data.We applied the technique to the HGU-133A, HG-U95A, and "Golden Spike" spike-in data sets.The resulting receiver operating characteristic (ROC) curves compared favorably with other published results.

View Article: PubMed Central - HTML - PubMed

Affiliation: Mathematics Department, Whittier College, 13406 E. Philadelphia St., Whittier, CA 90608, USA. brubin698@earthlink.net

ABSTRACT

Background: The disparate results from the methods commonly used to determine differential expression in Affymetrix microarray experiments may well result from the wide variety of probe set and probe level models employed. Here we take the approach of making the fewest assumptions about the structure of the microarray data. Specifically, we only require that, under the hypothesis that a gene is not differentially expressed for specified conditions, for any probe position in the gene's probe set: a) the probe amplitudes are independent and identically distributed over the conditions, and b) the distributions of the replicated probe amplitudes are amenable to classical analysis of variance (ANOVA). Log-amplitudes that have been standardized within-chip meet these conditions well enough for our approach, which is to perform ANOVA across conditions for each probe position, and then take the median of the resulting (1 - p) values as a gene-level measure of differential expression.

Results: We applied the technique to the HGU-133A, HG-U95A, and "Golden Spike" spike-in data sets. The resulting receiver operating characteristic (ROC) curves compared favorably with other published results. This procedure is quite sensitive, so much so that it has revealed the presence of probe sets that might properly be called "unanticipated positives" rather than "false positives", because plots of these probe sets strongly suggest that they are differentially expressed.

Conclusion: The median ANOVA (1-p) approach presented here is a very simple methodology that does not depend on any specific probe level or probe models, and does not require any pre-processing other than within-chip standardization of probe level log amplitudes. Its performance is comparable to other published methods on the standard spike-in data sets, and has revealed the presence of new categories of probe sets that might properly be referred to as "unanticipated positives" and "unanticipated negatives" that need to be taken into account when using spiked-in data sets at "truthed" test beds.

Show MeSH

Related in: MedlinePlus

Plot of a Golden Spike "Unanticipated Negative" with Fold Change = 3. We also find some "unanticipated negatives" - genes which should be differentially expressed based on a large initial fold change, but which in practice do not appear to be. In the Golden Spike experiment 153553_at has a log 2 fold change of 3, yet it ranks 7246, 10007, 6034 and 6049 (out of 14010 genes in the comparison) using median ANOVA (1-p), median signed ANOVA (1-p), RMA and PLM measures, respectively. This plot of the within-chip z-scores (control arrays are in red, spiked-in arrays in cyan) and the corresponding p-values for the gene support the conclusion that 153553_at is a not differentially expressed and therefore an unanticipated negative.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2749840&req=5

Figure 16: Plot of a Golden Spike "Unanticipated Negative" with Fold Change = 3. We also find some "unanticipated negatives" - genes which should be differentially expressed based on a large initial fold change, but which in practice do not appear to be. In the Golden Spike experiment 153553_at has a log 2 fold change of 3, yet it ranks 7246, 10007, 6034 and 6049 (out of 14010 genes in the comparison) using median ANOVA (1-p), median signed ANOVA (1-p), RMA and PLM measures, respectively. This plot of the within-chip z-scores (control arrays are in red, spiked-in arrays in cyan) and the corresponding p-values for the gene support the conclusion that 153553_at is a not differentially expressed and therefore an unanticipated negative.

Mentions: When it comes to dealing with what would be considered false negatives from the perspective of the spike-in concentrations, the situation is a bit more complicated. First, there are some genes with high spike-in concentrations that should have manifested differential expression in the CEL files, but for some reason did not. For example, 153553_at is a gene in the Golden Spike experiment with a log 2 fold change of 3. Yet it ranks 7246, 10007, 6034, and 6049 (out of 14010) according to the median ANOVA (1-p), signed median ANOVA (1-p), RMA and PLM measures of differential expression, respectively. Its profile in Figure 16 and its p-values are that of a non-expressed gene. In this case we have what might be called an "unanticipated negative". Second, very low concentrations might have been included in the spike-in experimental design in order to assess the lower limits of sensitivity of the various differential expression algorithms, but they also serve to establish the lower limits at which differential expression actually occurs. Those genes for which differential expression does occur at the lowest concentrations do indeed provide the test bed for the sensitivity floor for any procedure. However, for many genes the numerous steps that take place after the preparation of the hybridizing mixture result in the gene not being characterized as differentially expressed in the CEL files. Because they are not actually expressed in the CEL files, these are not really "false negatives". Because it is no surprise that genes with very low concentrations wind up being non-expressed, they really aren't "unexpected negatives" either. For such genes one should not penalize an algorithm for not being able to distinguish a difference that existed at the start of the experiment but did not make it through to the final CEL file product. Figure 17 provides an example of each of these types of genes. The problem is how to tell one type of gene from the other in an efficient manner. Notice that in these cases, for which the probe set profiles for the conditions overlap or cross each other, we needed to use the median of the signed ANOVA (1-p)'s as the measure of differential expression. (When there is no overlap or crossing of profiles for the conditions, except for the sign of the median ANOVA (1-p), it does not matter whether we use the signed or unsigned methodology).


A first principles approach to differential expression in microarray data analysis.

Rubin RA - BMC Bioinformatics (2009)

Plot of a Golden Spike "Unanticipated Negative" with Fold Change = 3. We also find some "unanticipated negatives" - genes which should be differentially expressed based on a large initial fold change, but which in practice do not appear to be. In the Golden Spike experiment 153553_at has a log 2 fold change of 3, yet it ranks 7246, 10007, 6034 and 6049 (out of 14010 genes in the comparison) using median ANOVA (1-p), median signed ANOVA (1-p), RMA and PLM measures, respectively. This plot of the within-chip z-scores (control arrays are in red, spiked-in arrays in cyan) and the corresponding p-values for the gene support the conclusion that 153553_at is a not differentially expressed and therefore an unanticipated negative.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2749840&req=5

Figure 16: Plot of a Golden Spike "Unanticipated Negative" with Fold Change = 3. We also find some "unanticipated negatives" - genes which should be differentially expressed based on a large initial fold change, but which in practice do not appear to be. In the Golden Spike experiment 153553_at has a log 2 fold change of 3, yet it ranks 7246, 10007, 6034 and 6049 (out of 14010 genes in the comparison) using median ANOVA (1-p), median signed ANOVA (1-p), RMA and PLM measures, respectively. This plot of the within-chip z-scores (control arrays are in red, spiked-in arrays in cyan) and the corresponding p-values for the gene support the conclusion that 153553_at is a not differentially expressed and therefore an unanticipated negative.
Mentions: When it comes to dealing with what would be considered false negatives from the perspective of the spike-in concentrations, the situation is a bit more complicated. First, there are some genes with high spike-in concentrations that should have manifested differential expression in the CEL files, but for some reason did not. For example, 153553_at is a gene in the Golden Spike experiment with a log 2 fold change of 3. Yet it ranks 7246, 10007, 6034, and 6049 (out of 14010) according to the median ANOVA (1-p), signed median ANOVA (1-p), RMA and PLM measures of differential expression, respectively. Its profile in Figure 16 and its p-values are that of a non-expressed gene. In this case we have what might be called an "unanticipated negative". Second, very low concentrations might have been included in the spike-in experimental design in order to assess the lower limits of sensitivity of the various differential expression algorithms, but they also serve to establish the lower limits at which differential expression actually occurs. Those genes for which differential expression does occur at the lowest concentrations do indeed provide the test bed for the sensitivity floor for any procedure. However, for many genes the numerous steps that take place after the preparation of the hybridizing mixture result in the gene not being characterized as differentially expressed in the CEL files. Because they are not actually expressed in the CEL files, these are not really "false negatives". Because it is no surprise that genes with very low concentrations wind up being non-expressed, they really aren't "unexpected negatives" either. For such genes one should not penalize an algorithm for not being able to distinguish a difference that existed at the start of the experiment but did not make it through to the final CEL file product. Figure 17 provides an example of each of these types of genes. The problem is how to tell one type of gene from the other in an efficient manner. Notice that in these cases, for which the probe set profiles for the conditions overlap or cross each other, we needed to use the median of the signed ANOVA (1-p)'s as the measure of differential expression. (When there is no overlap or crossing of profiles for the conditions, except for the sign of the median ANOVA (1-p), it does not matter whether we use the signed or unsigned methodology).

Bottom Line: Here we take the approach of making the fewest assumptions about the structure of the microarray data.We applied the technique to the HGU-133A, HG-U95A, and "Golden Spike" spike-in data sets.The resulting receiver operating characteristic (ROC) curves compared favorably with other published results.

View Article: PubMed Central - HTML - PubMed

Affiliation: Mathematics Department, Whittier College, 13406 E. Philadelphia St., Whittier, CA 90608, USA. brubin698@earthlink.net

ABSTRACT

Background: The disparate results from the methods commonly used to determine differential expression in Affymetrix microarray experiments may well result from the wide variety of probe set and probe level models employed. Here we take the approach of making the fewest assumptions about the structure of the microarray data. Specifically, we only require that, under the hypothesis that a gene is not differentially expressed for specified conditions, for any probe position in the gene's probe set: a) the probe amplitudes are independent and identically distributed over the conditions, and b) the distributions of the replicated probe amplitudes are amenable to classical analysis of variance (ANOVA). Log-amplitudes that have been standardized within-chip meet these conditions well enough for our approach, which is to perform ANOVA across conditions for each probe position, and then take the median of the resulting (1 - p) values as a gene-level measure of differential expression.

Results: We applied the technique to the HGU-133A, HG-U95A, and "Golden Spike" spike-in data sets. The resulting receiver operating characteristic (ROC) curves compared favorably with other published results. This procedure is quite sensitive, so much so that it has revealed the presence of probe sets that might properly be called "unanticipated positives" rather than "false positives", because plots of these probe sets strongly suggest that they are differentially expressed.

Conclusion: The median ANOVA (1-p) approach presented here is a very simple methodology that does not depend on any specific probe level or probe models, and does not require any pre-processing other than within-chip standardization of probe level log amplitudes. Its performance is comparable to other published methods on the standard spike-in data sets, and has revealed the presence of new categories of probe sets that might properly be referred to as "unanticipated positives" and "unanticipated negatives" that need to be taken into account when using spiked-in data sets at "truthed" test beds.

Show MeSH
Related in: MedlinePlus