Limits...
Considerations when using the significance analysis of microarrays (SAM) algorithm.

Larsson O, Wahlestedt C, Timmons JA - BMC Bioinformatics (2005)

Bottom Line: We have examined the effect of discrete data selection criteria (qualification criteria for inclusion) and response thresholds (out-put filtering) on the number of significant genes reported by SAM.This effect can be so large that it changes subsequent post hoc analysis interpretation, such as ontology overrepresentation analysis.Our results argue for caution when using SAM.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius Väg, 35, 171 77 Stockholm, Sweden. ola.larsson@cgb.ki.se

ABSTRACT

Background: Users of microarray technology typically strive to use universally acceptable data analysis strategies to determine significant expression changes in their experiments. One of the most frequently utilised methods for gene expression data analysis is SAM (significance analysis of microarrays). The impact of selection thresholds, on the output from SAM, may critically alter the conclusion of a study, yet this consideration has not been systematically evaluated in any publication.

Results: We have examined the effect of discrete data selection criteria (qualification criteria for inclusion) and response thresholds (out-put filtering) on the number of significant genes reported by SAM. The use of a reduced data set by applying arbitrary restrictions vis-à-vis abundance calls (e.g. from D-chip) or application of the fold change (FC) option within SAM (named the FC hurdle hereafter), can substantially alter the significant gene list when running SAM in Microsoft Excel. We determined that for a given final FC criteria (e.g. 1.5 fold change) the FC hurdle applied within Microsoft Excel SAM alters the number of reported genes above the final FC criteria. The reason is that the FC hurdle changes the composition of the control data set, such that a different significance level (q-value) is obtained for any given gene. This effect can be so large that it changes subsequent post hoc analysis interpretation, such as ontology overrepresentation analysis.

Conclusion: Our results argue for caution when using SAM. All data sets analysed with SAM could be reanalysed taking into account the potential impact of the use of arbitrary thresholds to trim data sets before significance testing.

Show MeSH

Related in: MedlinePlus

FC effects on the endurance training data set: SAM analysis was used at various fold changes studying the exercise data set while scoring genes with a q-value of <0.05 and FC>1.5. This was done to asses the effect of the fold change option in the SAM Excel addin on genes reported as significant at a higher fold change. The figure shows the number of scored genes using 4 different chip and sample combinations: (A) All eight subjects before and after training (totally 16 arrays) were compared in a paired analysis using the U95A chips. (B) All eight subjects before and after training (totally 16 arrays) were compared in a paired analysis using the U95B chips. (C) The reduced group consisting of low four low responders (totally 8 arrays) were compared in a paired analysis using the U95B chips. (D) The reduced group consisting of low four low responders (totally 8 arrays) were compared in a paired analysis using the U95D chips.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1173086&req=5

Figure 1: FC effects on the endurance training data set: SAM analysis was used at various fold changes studying the exercise data set while scoring genes with a q-value of <0.05 and FC>1.5. This was done to asses the effect of the fold change option in the SAM Excel addin on genes reported as significant at a higher fold change. The figure shows the number of scored genes using 4 different chip and sample combinations: (A) All eight subjects before and after training (totally 16 arrays) were compared in a paired analysis using the U95A chips. (B) All eight subjects before and after training (totally 16 arrays) were compared in a paired analysis using the U95B chips. (C) The reduced group consisting of low four low responders (totally 8 arrays) were compared in a paired analysis using the U95B chips. (D) The reduced group consisting of low four low responders (totally 8 arrays) were compared in a paired analysis using the U95D chips.

Mentions: We used three different biological data sets to assess how wide-spread any effects were. The first data set was a paired data set from a human skeletal muscle study which examined subjects before and after endurance training using the U95A-E platform (Affymetrix) [3]. The data was RMA normalized [4-6] and SAM was performed using different biological sample subgroups (groups were formed on the basis of a variety of physiological parameters) and chipset identity (A, B or D). Changing the FC setting in SAM (Excel) altered the final list of significant genes at the 1.5 FC criteria (Figure 1A–D). Importantly, the effect was not uniform across all conditions. For some samples a sequential increase in the FC hurdle in SAM correlated with an increased number of reported genes (Figure 1A–B) while under other conditions a very small change in FC had an apparently random impact on the composition of the significant gene list (e.g. Figure 1C–D). (The number of genes that passed each fold change criteria can be found in [Additional file 1]).


Considerations when using the significance analysis of microarrays (SAM) algorithm.

Larsson O, Wahlestedt C, Timmons JA - BMC Bioinformatics (2005)

FC effects on the endurance training data set: SAM analysis was used at various fold changes studying the exercise data set while scoring genes with a q-value of <0.05 and FC>1.5. This was done to asses the effect of the fold change option in the SAM Excel addin on genes reported as significant at a higher fold change. The figure shows the number of scored genes using 4 different chip and sample combinations: (A) All eight subjects before and after training (totally 16 arrays) were compared in a paired analysis using the U95A chips. (B) All eight subjects before and after training (totally 16 arrays) were compared in a paired analysis using the U95B chips. (C) The reduced group consisting of low four low responders (totally 8 arrays) were compared in a paired analysis using the U95B chips. (D) The reduced group consisting of low four low responders (totally 8 arrays) were compared in a paired analysis using the U95D chips.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1173086&req=5

Figure 1: FC effects on the endurance training data set: SAM analysis was used at various fold changes studying the exercise data set while scoring genes with a q-value of <0.05 and FC>1.5. This was done to asses the effect of the fold change option in the SAM Excel addin on genes reported as significant at a higher fold change. The figure shows the number of scored genes using 4 different chip and sample combinations: (A) All eight subjects before and after training (totally 16 arrays) were compared in a paired analysis using the U95A chips. (B) All eight subjects before and after training (totally 16 arrays) were compared in a paired analysis using the U95B chips. (C) The reduced group consisting of low four low responders (totally 8 arrays) were compared in a paired analysis using the U95B chips. (D) The reduced group consisting of low four low responders (totally 8 arrays) were compared in a paired analysis using the U95D chips.
Mentions: We used three different biological data sets to assess how wide-spread any effects were. The first data set was a paired data set from a human skeletal muscle study which examined subjects before and after endurance training using the U95A-E platform (Affymetrix) [3]. The data was RMA normalized [4-6] and SAM was performed using different biological sample subgroups (groups were formed on the basis of a variety of physiological parameters) and chipset identity (A, B or D). Changing the FC setting in SAM (Excel) altered the final list of significant genes at the 1.5 FC criteria (Figure 1A–D). Importantly, the effect was not uniform across all conditions. For some samples a sequential increase in the FC hurdle in SAM correlated with an increased number of reported genes (Figure 1A–B) while under other conditions a very small change in FC had an apparently random impact on the composition of the significant gene list (e.g. Figure 1C–D). (The number of genes that passed each fold change criteria can be found in [Additional file 1]).

Bottom Line: We have examined the effect of discrete data selection criteria (qualification criteria for inclusion) and response thresholds (out-put filtering) on the number of significant genes reported by SAM.This effect can be so large that it changes subsequent post hoc analysis interpretation, such as ontology overrepresentation analysis.Our results argue for caution when using SAM.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius Väg, 35, 171 77 Stockholm, Sweden. ola.larsson@cgb.ki.se

ABSTRACT

Background: Users of microarray technology typically strive to use universally acceptable data analysis strategies to determine significant expression changes in their experiments. One of the most frequently utilised methods for gene expression data analysis is SAM (significance analysis of microarrays). The impact of selection thresholds, on the output from SAM, may critically alter the conclusion of a study, yet this consideration has not been systematically evaluated in any publication.

Results: We have examined the effect of discrete data selection criteria (qualification criteria for inclusion) and response thresholds (out-put filtering) on the number of significant genes reported by SAM. The use of a reduced data set by applying arbitrary restrictions vis-à-vis abundance calls (e.g. from D-chip) or application of the fold change (FC) option within SAM (named the FC hurdle hereafter), can substantially alter the significant gene list when running SAM in Microsoft Excel. We determined that for a given final FC criteria (e.g. 1.5 fold change) the FC hurdle applied within Microsoft Excel SAM alters the number of reported genes above the final FC criteria. The reason is that the FC hurdle changes the composition of the control data set, such that a different significance level (q-value) is obtained for any given gene. This effect can be so large that it changes subsequent post hoc analysis interpretation, such as ontology overrepresentation analysis.

Conclusion: Our results argue for caution when using SAM. All data sets analysed with SAM could be reanalysed taking into account the potential impact of the use of arbitrary thresholds to trim data sets before significance testing.

Show MeSH
Related in: MedlinePlus