Limits...
Considerations when using the significance analysis of microarrays (SAM) algorithm.

Larsson O, Wahlestedt C, Timmons JA - BMC Bioinformatics (2005)

Bottom Line: We have examined the effect of discrete data selection criteria (qualification criteria for inclusion) and response thresholds (out-put filtering) on the number of significant genes reported by SAM.This effect can be so large that it changes subsequent post hoc analysis interpretation, such as ontology overrepresentation analysis.Our results argue for caution when using SAM.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius Väg, 35, 171 77 Stockholm, Sweden. ola.larsson@cgb.ki.se

ABSTRACT

Background: Users of microarray technology typically strive to use universally acceptable data analysis strategies to determine significant expression changes in their experiments. One of the most frequently utilised methods for gene expression data analysis is SAM (significance analysis of microarrays). The impact of selection thresholds, on the output from SAM, may critically alter the conclusion of a study, yet this consideration has not been systematically evaluated in any publication.

Results: We have examined the effect of discrete data selection criteria (qualification criteria for inclusion) and response thresholds (out-put filtering) on the number of significant genes reported by SAM. The use of a reduced data set by applying arbitrary restrictions vis-à-vis abundance calls (e.g. from D-chip) or application of the fold change (FC) option within SAM (named the FC hurdle hereafter), can substantially alter the significant gene list when running SAM in Microsoft Excel. We determined that for a given final FC criteria (e.g. 1.5 fold change) the FC hurdle applied within Microsoft Excel SAM alters the number of reported genes above the final FC criteria. The reason is that the FC hurdle changes the composition of the control data set, such that a different significance level (q-value) is obtained for any given gene. This effect can be so large that it changes subsequent post hoc analysis interpretation, such as ontology overrepresentation analysis.

Conclusion: Our results argue for caution when using SAM. All data sets analysed with SAM could be reanalysed taking into account the potential impact of the use of arbitrary thresholds to trim data sets before significance testing.

Show MeSH

Related in: MedlinePlus

FC effects individual q-values: q-values of all genes scored as significant in the brain aging study (reduced data set) at FC 1.51 at other FC settings. Genes only acquires discrete q-values and all 538 genes are shown, but overlap. (B) Running SAM with different FC settings changes the biological interpretation: Venn diagram comparing the number of significantly overrepresented classifications (EASE score <0.05) using the reduced brain aging data set analysed either with a 1.0 FC setting (314 genes) or a 1.5 FC setting (538 genes).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1173086&req=5

Figure 4: FC effects individual q-values: q-values of all genes scored as significant in the brain aging study (reduced data set) at FC 1.51 at other FC settings. Genes only acquires discrete q-values and all 538 genes are shown, but overlap. (B) Running SAM with different FC settings changes the biological interpretation: Venn diagram comparing the number of significantly overrepresented classifications (EASE score <0.05) using the reduced brain aging data set analysed either with a 1.0 FC setting (314 genes) or a 1.5 FC setting (538 genes).

Mentions: The analysis presented above demonstrates that the q-value obtained for a specific gene depends on the FC hurdle applied during SAM in Excel. To monitor the q-values generated, for individual genes, we obtained the q-values during all SAM calculation using various FC hurdles for all genes that were reported as significant at the 1.51 FC setting (538 genes) in the brain aging study (Figure 4A). The highest q-value for a subset of the genes that passed the final fold change criteria was 3.6% when SAM was performed with a 1.0 FC hurdle. The same genes appear as being significant when the "optimal" (in the sense that these setting produced the largest significant gene list) Excel SAM FC hurdle setting 1.51 is used while the highest q-value reported was now only 0.97%.


Considerations when using the significance analysis of microarrays (SAM) algorithm.

Larsson O, Wahlestedt C, Timmons JA - BMC Bioinformatics (2005)

FC effects individual q-values: q-values of all genes scored as significant in the brain aging study (reduced data set) at FC 1.51 at other FC settings. Genes only acquires discrete q-values and all 538 genes are shown, but overlap. (B) Running SAM with different FC settings changes the biological interpretation: Venn diagram comparing the number of significantly overrepresented classifications (EASE score <0.05) using the reduced brain aging data set analysed either with a 1.0 FC setting (314 genes) or a 1.5 FC setting (538 genes).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1173086&req=5

Figure 4: FC effects individual q-values: q-values of all genes scored as significant in the brain aging study (reduced data set) at FC 1.51 at other FC settings. Genes only acquires discrete q-values and all 538 genes are shown, but overlap. (B) Running SAM with different FC settings changes the biological interpretation: Venn diagram comparing the number of significantly overrepresented classifications (EASE score <0.05) using the reduced brain aging data set analysed either with a 1.0 FC setting (314 genes) or a 1.5 FC setting (538 genes).
Mentions: The analysis presented above demonstrates that the q-value obtained for a specific gene depends on the FC hurdle applied during SAM in Excel. To monitor the q-values generated, for individual genes, we obtained the q-values during all SAM calculation using various FC hurdles for all genes that were reported as significant at the 1.51 FC setting (538 genes) in the brain aging study (Figure 4A). The highest q-value for a subset of the genes that passed the final fold change criteria was 3.6% when SAM was performed with a 1.0 FC hurdle. The same genes appear as being significant when the "optimal" (in the sense that these setting produced the largest significant gene list) Excel SAM FC hurdle setting 1.51 is used while the highest q-value reported was now only 0.97%.

Bottom Line: We have examined the effect of discrete data selection criteria (qualification criteria for inclusion) and response thresholds (out-put filtering) on the number of significant genes reported by SAM.This effect can be so large that it changes subsequent post hoc analysis interpretation, such as ontology overrepresentation analysis.Our results argue for caution when using SAM.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius Väg, 35, 171 77 Stockholm, Sweden. ola.larsson@cgb.ki.se

ABSTRACT

Background: Users of microarray technology typically strive to use universally acceptable data analysis strategies to determine significant expression changes in their experiments. One of the most frequently utilised methods for gene expression data analysis is SAM (significance analysis of microarrays). The impact of selection thresholds, on the output from SAM, may critically alter the conclusion of a study, yet this consideration has not been systematically evaluated in any publication.

Results: We have examined the effect of discrete data selection criteria (qualification criteria for inclusion) and response thresholds (out-put filtering) on the number of significant genes reported by SAM. The use of a reduced data set by applying arbitrary restrictions vis-à-vis abundance calls (e.g. from D-chip) or application of the fold change (FC) option within SAM (named the FC hurdle hereafter), can substantially alter the significant gene list when running SAM in Microsoft Excel. We determined that for a given final FC criteria (e.g. 1.5 fold change) the FC hurdle applied within Microsoft Excel SAM alters the number of reported genes above the final FC criteria. The reason is that the FC hurdle changes the composition of the control data set, such that a different significance level (q-value) is obtained for any given gene. This effect can be so large that it changes subsequent post hoc analysis interpretation, such as ontology overrepresentation analysis.

Conclusion: Our results argue for caution when using SAM. All data sets analysed with SAM could be reanalysed taking into account the potential impact of the use of arbitrary thresholds to trim data sets before significance testing.

Show MeSH
Related in: MedlinePlus