Limits...
A first principles approach to differential expression in microarray data analysis.

Rubin RA - BMC Bioinformatics (2009)

Bottom Line: Here we take the approach of making the fewest assumptions about the structure of the microarray data.We applied the technique to the HGU-133A, HG-U95A, and "Golden Spike" spike-in data sets.The resulting receiver operating characteristic (ROC) curves compared favorably with other published results.

View Article: PubMed Central - HTML - PubMed

Affiliation: Mathematics Department, Whittier College, 13406 E. Philadelphia St., Whittier, CA 90608, USA. brubin698@earthlink.net

ABSTRACT

Background: The disparate results from the methods commonly used to determine differential expression in Affymetrix microarray experiments may well result from the wide variety of probe set and probe level models employed. Here we take the approach of making the fewest assumptions about the structure of the microarray data. Specifically, we only require that, under the hypothesis that a gene is not differentially expressed for specified conditions, for any probe position in the gene's probe set: a) the probe amplitudes are independent and identically distributed over the conditions, and b) the distributions of the replicated probe amplitudes are amenable to classical analysis of variance (ANOVA). Log-amplitudes that have been standardized within-chip meet these conditions well enough for our approach, which is to perform ANOVA across conditions for each probe position, and then take the median of the resulting (1 - p) values as a gene-level measure of differential expression.

Results: We applied the technique to the HGU-133A, HG-U95A, and "Golden Spike" spike-in data sets. The resulting receiver operating characteristic (ROC) curves compared favorably with other published results. This procedure is quite sensitive, so much so that it has revealed the presence of probe sets that might properly be called "unanticipated positives" rather than "false positives", because plots of these probe sets strongly suggest that they are differentially expressed.

Conclusion: The median ANOVA (1-p) approach presented here is a very simple methodology that does not depend on any specific probe level or probe models, and does not require any pre-processing other than within-chip standardization of probe level log amplitudes. Its performance is comparable to other published methods on the standard spike-in data sets, and has revealed the presence of new categories of probe sets that might properly be referred to as "unanticipated positives" and "unanticipated negatives" that need to be taken into account when using spiked-in data sets at "truthed" test beds.

Show MeSH

Related in: MedlinePlus

Plot of gene with a possibly ambiguous differential expression status. In this paper we point out the need to establish truth at the probe set/CEL file level as well as at the spike-in level. Achieving that goal will require community-wide consensus as to what constitutes differential expression, based on the contents of the probe sets involved in the comparison. For some genes consensus might not be an easy thing to achieve, as this plot of 1552_i_at from the HG-U95A experiment suggests. HG-U95A experimental condition Q is plotted in red, and condition A in cyan. Even though all summary measures considered in this paper declare this gene to be highly differentially expressed, it would not be at all surprising to find many opinions as to whether this gene is really expressed or not. Just as "Marginal" is an acceptable condition when making MAS 5 calls, so "Ambiguous" may have to be an acceptable state for some comparisons.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2749840&req=5

Figure 18: Plot of gene with a possibly ambiguous differential expression status. In this paper we point out the need to establish truth at the probe set/CEL file level as well as at the spike-in level. Achieving that goal will require community-wide consensus as to what constitutes differential expression, based on the contents of the probe sets involved in the comparison. For some genes consensus might not be an easy thing to achieve, as this plot of 1552_i_at from the HG-U95A experiment suggests. HG-U95A experimental condition Q is plotted in red, and condition A in cyan. Even though all summary measures considered in this paper declare this gene to be highly differentially expressed, it would not be at all surprising to find many opinions as to whether this gene is really expressed or not. Just as "Marginal" is an acceptable condition when making MAS 5 calls, so "Ambiguous" may have to be an acceptable state for some comparisons.

Mentions: Determining truth at the CEL file level for all pairs of conditions will take a lot of work, but that is a requirement for a test bed that can be trusted to assess the effectiveness of the various differential expression paradigms. However, in practice there will probably be "only" several hundred comparative gene conditions that will require close examination of probe set profiles, and initially it makes sense to focus on the d = 1 conditions. It may well be that for some genes there will not be community-wide agreement as to whether they are differentially expressed or not for some pairs of conditions. Figure 18 presents a possible example of such a gene from the Q versus A conditions of the HGU-95A Latin Square design. Although the adjusted p-values for median signed ANOVA (1-p), RMA and PLM (8.7 × 10-18, 6.9 × 10-12, and 4.6 × 10-16, respectively) all strongly indicate differential expression, a biologist looking at the profile plots might well have second thoughts (either about differential expression or about the validity of the probe set itself). If researchers cannot agree if a collection of probe sets is differentially expressed or not, we cannot expect mechanized procedures to be in agreement either. In such cases, there may need to be a label in the CEL file truth metadata indicating the ambiguous status of the condition.


A first principles approach to differential expression in microarray data analysis.

Rubin RA - BMC Bioinformatics (2009)

Plot of gene with a possibly ambiguous differential expression status. In this paper we point out the need to establish truth at the probe set/CEL file level as well as at the spike-in level. Achieving that goal will require community-wide consensus as to what constitutes differential expression, based on the contents of the probe sets involved in the comparison. For some genes consensus might not be an easy thing to achieve, as this plot of 1552_i_at from the HG-U95A experiment suggests. HG-U95A experimental condition Q is plotted in red, and condition A in cyan. Even though all summary measures considered in this paper declare this gene to be highly differentially expressed, it would not be at all surprising to find many opinions as to whether this gene is really expressed or not. Just as "Marginal" is an acceptable condition when making MAS 5 calls, so "Ambiguous" may have to be an acceptable state for some comparisons.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2749840&req=5

Figure 18: Plot of gene with a possibly ambiguous differential expression status. In this paper we point out the need to establish truth at the probe set/CEL file level as well as at the spike-in level. Achieving that goal will require community-wide consensus as to what constitutes differential expression, based on the contents of the probe sets involved in the comparison. For some genes consensus might not be an easy thing to achieve, as this plot of 1552_i_at from the HG-U95A experiment suggests. HG-U95A experimental condition Q is plotted in red, and condition A in cyan. Even though all summary measures considered in this paper declare this gene to be highly differentially expressed, it would not be at all surprising to find many opinions as to whether this gene is really expressed or not. Just as "Marginal" is an acceptable condition when making MAS 5 calls, so "Ambiguous" may have to be an acceptable state for some comparisons.
Mentions: Determining truth at the CEL file level for all pairs of conditions will take a lot of work, but that is a requirement for a test bed that can be trusted to assess the effectiveness of the various differential expression paradigms. However, in practice there will probably be "only" several hundred comparative gene conditions that will require close examination of probe set profiles, and initially it makes sense to focus on the d = 1 conditions. It may well be that for some genes there will not be community-wide agreement as to whether they are differentially expressed or not for some pairs of conditions. Figure 18 presents a possible example of such a gene from the Q versus A conditions of the HGU-95A Latin Square design. Although the adjusted p-values for median signed ANOVA (1-p), RMA and PLM (8.7 × 10-18, 6.9 × 10-12, and 4.6 × 10-16, respectively) all strongly indicate differential expression, a biologist looking at the profile plots might well have second thoughts (either about differential expression or about the validity of the probe set itself). If researchers cannot agree if a collection of probe sets is differentially expressed or not, we cannot expect mechanized procedures to be in agreement either. In such cases, there may need to be a label in the CEL file truth metadata indicating the ambiguous status of the condition.

Bottom Line: Here we take the approach of making the fewest assumptions about the structure of the microarray data.We applied the technique to the HGU-133A, HG-U95A, and "Golden Spike" spike-in data sets.The resulting receiver operating characteristic (ROC) curves compared favorably with other published results.

View Article: PubMed Central - HTML - PubMed

Affiliation: Mathematics Department, Whittier College, 13406 E. Philadelphia St., Whittier, CA 90608, USA. brubin698@earthlink.net

ABSTRACT

Background: The disparate results from the methods commonly used to determine differential expression in Affymetrix microarray experiments may well result from the wide variety of probe set and probe level models employed. Here we take the approach of making the fewest assumptions about the structure of the microarray data. Specifically, we only require that, under the hypothesis that a gene is not differentially expressed for specified conditions, for any probe position in the gene's probe set: a) the probe amplitudes are independent and identically distributed over the conditions, and b) the distributions of the replicated probe amplitudes are amenable to classical analysis of variance (ANOVA). Log-amplitudes that have been standardized within-chip meet these conditions well enough for our approach, which is to perform ANOVA across conditions for each probe position, and then take the median of the resulting (1 - p) values as a gene-level measure of differential expression.

Results: We applied the technique to the HGU-133A, HG-U95A, and "Golden Spike" spike-in data sets. The resulting receiver operating characteristic (ROC) curves compared favorably with other published results. This procedure is quite sensitive, so much so that it has revealed the presence of probe sets that might properly be called "unanticipated positives" rather than "false positives", because plots of these probe sets strongly suggest that they are differentially expressed.

Conclusion: The median ANOVA (1-p) approach presented here is a very simple methodology that does not depend on any specific probe level or probe models, and does not require any pre-processing other than within-chip standardization of probe level log amplitudes. Its performance is comparable to other published methods on the standard spike-in data sets, and has revealed the presence of new categories of probe sets that might properly be referred to as "unanticipated positives" and "unanticipated negatives" that need to be taken into account when using spiked-in data sets at "truthed" test beds.

Show MeSH
Related in: MedlinePlus