Limits...
Use of normalization methods for analysis of microarrays containing a high degree of gene effects.

Ni TT, Lemon WJ, Shyr Y, Zhong TP - BMC Bioinformatics (2008)

Bottom Line: We have demonstrated that the new method provides considerable improvement in the accuracy of data normalization when large proportions of gene effects are present.Adding this key component of the new method to alternative normalization approaches rescues the most of the sensitivity of these methods to gene effects.The results indicate that our method may be used without prior knowledge of or assumptions about housekeeping genes to normalize microarrays that are quite different.

View Article: PubMed Central - HTML - PubMed

Affiliation: Division of Cardiovascular Medicine, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN 37232, USA. terri.ni@vanderbilt.edu

ABSTRACT

Background: High-throughput microarrays are widely used to study gene expression across tissues and developmental stages. Analysis of gene expression data is challenging in these experiments due to the presence of significant percentages of differentially expressed genes (DEG) observed between tissues and developmental stages. Data normalization methods that are widely used today are not designed for data with a large proportion of tissue or gene effects.

Results: In our current study, we describe a novel two-dimensional nonparametric normalization method for analyzing microarray data which functions well in the absence or presence of large numbers of gene effects. Rather than relying on an assumption of low variability among most genes, the method implements a unique peak selection strategy to distinguish DEG from genes that are invariant in expression, prior to nonlinear curve fitting. We compared the method under simulated and experimental conditions with five alternative nonlinear normalization approaches: quantile, lowess, robust lowess, invariant set, and cross-correlation (Xcorr). Simulations included various percentages of simulated DEG and the experimental data used is from publicly available datasets known to be difficult to analyze due to the presence of approximately 34% DEG.

Conclusion: We have demonstrated that the new method provides considerable improvement in the accuracy of data normalization when large proportions of gene effects are present. The performance improvement is mostly attributed to its variable selection component, which is designed to separate expression invariant genes from DEG. Adding this key component of the new method to alternative normalization approaches rescues the most of the sensitivity of these methods to gene effects. The results indicate that our method may be used without prior knowledge of or assumptions about housekeeping genes to normalize microarrays that are quite different.

Show MeSH

Related in: MedlinePlus

Kolmogorov-Smirnov tests for agreement of normalization in the analysis of Golden Spike data. Assessment of statistical difference (Kolmogorov-Smirnov tests) between normalized intensities computed via all-gene approach and normalized intensities via non-DEG genes with the same normalization method. All normalized intensities were calculated relative to a reference array. Each one of six arrays was chosen as a reference array. A total of 12 replicate array pairs and 18 non-replicate arrays pairs could be formed. Blue, green, cyan, magenta, black, and red mark the reference array C1, C2, C3, S1, S2, and S3, respectively. Difference in two normalization curves is statistically significant when P value < 0.01.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2612699&req=5

Figure 6: Kolmogorov-Smirnov tests for agreement of normalization in the analysis of Golden Spike data. Assessment of statistical difference (Kolmogorov-Smirnov tests) between normalized intensities computed via all-gene approach and normalized intensities via non-DEG genes with the same normalization method. All normalized intensities were calculated relative to a reference array. Each one of six arrays was chosen as a reference array. A total of 12 replicate array pairs and 18 non-replicate arrays pairs could be formed. Blue, green, cyan, magenta, black, and red mark the reference array C1, C2, C3, S1, S2, and S3, respectively. Difference in two normalization curves is statistically significant when P value < 0.01.

Mentions: Figure 5 illustrates typical agreement of normalization curves computed from all genes (red) and non-DEG (black) when C and S chips (non-replicate; Fig. 5A) or C and C chips (replicate; Fig. 5B) are paired. The red and black lines overlap over the entire intensity range in all methods with the replicate array pair, indicating that normalized intensities from all genes agree with the ones computed from non-DEG genes under 0% gene effects (Fig. 5B). However, in the non-replicate array pair, all methods except NVSA and cross-correlation display significant divergence between black and red colored normalization curves, and the divergence appears rising with the increasing amounts of DEG (Fig. 5A). We then employed the two-sample Kolmogorov-Smirnov test statistic to test the hypothesis that the two sets of normalized intensities (or two normalization curves) were drawn from the same population (Fig. 6). At a 1% significance level (p < 0.01), the hypothesis is accepted for all NVSA analyses under conditions of both replicate and non-replicate array pairs, indicating that the normalized intensities based on all genes by NVSA are statistically indistinguishable from the ones based on known invariant genes on all tested conditions. However, quantile, lowess, invariant set, and robust lowess methods display statistically significant divergence in two populations of normalized intensities on all non-replicate array pairs (Fig. 6), yet no significant divergence on replicate pairs. In data analyzed by the cross-correlation method, 11 out of 15 non-replicate array pairs display statistically significant divergence, this agrees with our previous observation with simulation data that the method generates substantial error rates when gene effects are greater than 30%.


Use of normalization methods for analysis of microarrays containing a high degree of gene effects.

Ni TT, Lemon WJ, Shyr Y, Zhong TP - BMC Bioinformatics (2008)

Kolmogorov-Smirnov tests for agreement of normalization in the analysis of Golden Spike data. Assessment of statistical difference (Kolmogorov-Smirnov tests) between normalized intensities computed via all-gene approach and normalized intensities via non-DEG genes with the same normalization method. All normalized intensities were calculated relative to a reference array. Each one of six arrays was chosen as a reference array. A total of 12 replicate array pairs and 18 non-replicate arrays pairs could be formed. Blue, green, cyan, magenta, black, and red mark the reference array C1, C2, C3, S1, S2, and S3, respectively. Difference in two normalization curves is statistically significant when P value < 0.01.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2612699&req=5

Figure 6: Kolmogorov-Smirnov tests for agreement of normalization in the analysis of Golden Spike data. Assessment of statistical difference (Kolmogorov-Smirnov tests) between normalized intensities computed via all-gene approach and normalized intensities via non-DEG genes with the same normalization method. All normalized intensities were calculated relative to a reference array. Each one of six arrays was chosen as a reference array. A total of 12 replicate array pairs and 18 non-replicate arrays pairs could be formed. Blue, green, cyan, magenta, black, and red mark the reference array C1, C2, C3, S1, S2, and S3, respectively. Difference in two normalization curves is statistically significant when P value < 0.01.
Mentions: Figure 5 illustrates typical agreement of normalization curves computed from all genes (red) and non-DEG (black) when C and S chips (non-replicate; Fig. 5A) or C and C chips (replicate; Fig. 5B) are paired. The red and black lines overlap over the entire intensity range in all methods with the replicate array pair, indicating that normalized intensities from all genes agree with the ones computed from non-DEG genes under 0% gene effects (Fig. 5B). However, in the non-replicate array pair, all methods except NVSA and cross-correlation display significant divergence between black and red colored normalization curves, and the divergence appears rising with the increasing amounts of DEG (Fig. 5A). We then employed the two-sample Kolmogorov-Smirnov test statistic to test the hypothesis that the two sets of normalized intensities (or two normalization curves) were drawn from the same population (Fig. 6). At a 1% significance level (p < 0.01), the hypothesis is accepted for all NVSA analyses under conditions of both replicate and non-replicate array pairs, indicating that the normalized intensities based on all genes by NVSA are statistically indistinguishable from the ones based on known invariant genes on all tested conditions. However, quantile, lowess, invariant set, and robust lowess methods display statistically significant divergence in two populations of normalized intensities on all non-replicate array pairs (Fig. 6), yet no significant divergence on replicate pairs. In data analyzed by the cross-correlation method, 11 out of 15 non-replicate array pairs display statistically significant divergence, this agrees with our previous observation with simulation data that the method generates substantial error rates when gene effects are greater than 30%.

Bottom Line: We have demonstrated that the new method provides considerable improvement in the accuracy of data normalization when large proportions of gene effects are present.Adding this key component of the new method to alternative normalization approaches rescues the most of the sensitivity of these methods to gene effects.The results indicate that our method may be used without prior knowledge of or assumptions about housekeeping genes to normalize microarrays that are quite different.

View Article: PubMed Central - HTML - PubMed

Affiliation: Division of Cardiovascular Medicine, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN 37232, USA. terri.ni@vanderbilt.edu

ABSTRACT

Background: High-throughput microarrays are widely used to study gene expression across tissues and developmental stages. Analysis of gene expression data is challenging in these experiments due to the presence of significant percentages of differentially expressed genes (DEG) observed between tissues and developmental stages. Data normalization methods that are widely used today are not designed for data with a large proportion of tissue or gene effects.

Results: In our current study, we describe a novel two-dimensional nonparametric normalization method for analyzing microarray data which functions well in the absence or presence of large numbers of gene effects. Rather than relying on an assumption of low variability among most genes, the method implements a unique peak selection strategy to distinguish DEG from genes that are invariant in expression, prior to nonlinear curve fitting. We compared the method under simulated and experimental conditions with five alternative nonlinear normalization approaches: quantile, lowess, robust lowess, invariant set, and cross-correlation (Xcorr). Simulations included various percentages of simulated DEG and the experimental data used is from publicly available datasets known to be difficult to analyze due to the presence of approximately 34% DEG.

Conclusion: We have demonstrated that the new method provides considerable improvement in the accuracy of data normalization when large proportions of gene effects are present. The performance improvement is mostly attributed to its variable selection component, which is designed to separate expression invariant genes from DEG. Adding this key component of the new method to alternative normalization approaches rescues the most of the sensitivity of these methods to gene effects. The results indicate that our method may be used without prior knowledge of or assumptions about housekeeping genes to normalize microarrays that are quite different.

Show MeSH
Related in: MedlinePlus