Limits...
Extracting gene expression patterns and identifying co-expressed genes from microarray data reveals biologically responsive processes.

Chou JW, Zhou T, Kaufmann WK, Paules RS, Bushel PR - BMC Bioinformatics (2007)

Bottom Line: However, the variation associated with microarray data and the complexity of the experimental designs make the acquisition of co-expressed genes a challenge.Among them were mitotic cell cycle, DNA replication, DNA repair, cell cycle checkpoint, and G0-like status transition.EPIG can be applied to data sets from a variety of experimental designs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Microarray Group, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, USA. chou@niehs.nih.gov

ABSTRACT

Background: A common observation in the analysis of gene expression data is that many genes display similarity in their expression patterns and therefore appear to be co-regulated. However, the variation associated with microarray data and the complexity of the experimental designs make the acquisition of co-expressed genes a challenge. We developed a novel method for Extracting microarray gene expression Patterns and Identifying co-expressed Genes, designated as EPIG. The approach utilizes the underlying structure of gene expression data to extract patterns and identify co-expressed genes that are responsive to experimental conditions.

Results: Through evaluation of the correlations among profiles, the magnitude of variation in gene expression profiles, and profile signal-to-noise ratio's, EPIG extracts a set of patterns representing co-expressed genes. The method is shown to work well with a simulated data set and microarray data obtained from time-series studies of dauer recovery and L1 starvation in C. elegans and after ultraviolet (UV) or ionizing radiation (IR)-induced DNA damage in diploid human fibroblasts. With the simulated data set, EPIG extracted the appropriate number of patterns which were more stable and homogeneous than the set of patterns that were determined using the CLICK or CAST clustering algorithms. However, CLICK performed better than EPIG and CAST with respect to the average correlation between clusters/patterns of the simulated data. With real biological data, EPIG extracted more dauer-specific patterns than CLICK. Furthermore, analysis of the IR/UV data revealed 18 unique patterns and 2661 genes out of approximately 17,000 that were identified as significantly expressed and categorized to the patterns by EPIG. The time-dependent patterns displayed similar and dissimilar responses between IR and UV treatments. Gene Ontology analysis applied to each pattern-related subset of co-expressed genes revealed underlying biological processes affected by IR- and/or UV- induced DNA damage.

Conclusion: EPIG competed with CLICK and performed better than CAST in extracting patterns from simulated data. EPIG extracted more biological informative patterns and co-expressed genes from both C. elegans and IR/UV-treated human fibroblasts. Using Gene Ontology analysis of the genes in the patterns extracted by EPIG, several key biological categories related to p53-dependent cell cycle control were revealed from the IR/UV data. Among them were mitotic cell cycle, DNA replication, DNA repair, cell cycle checkpoint, and G0-like status transition. EPIG can be applied to data sets from a variety of experimental designs.

Show MeSH

Related in: MedlinePlus

Optimization of the Mt value. Cluster size threshold Mt (the horizontal axis) verses average of patterns' SNR (A) and number of extracted patterns (B).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2194742&req=5

Figure 6: Optimization of the Mt value. Cluster size threshold Mt (the horizontal axis) verses average of patterns' SNR (A) and number of extracted patterns (B).

Mentions: There are two main thresholds used in EPIG pattern extraction: the local cluster size threshold Mt and the correlation threshold Rt. Rt determines the closeness in similarity that is allowed among the extracted patterns. Depending on the sample size, one may determine Rt such that the most similar patterns possess clear response differences. For example, in Figure 4, Patterns 5 and 6 have a correlation r-value of 0.77. But the two contain genes with expression patterns that display a clear difference in the response to UV-induced DNA damage. In Pattern 5, gene expression was repressed only at 2 h post-UV while in Pattern 6, gene expression was repressed at both 2 and 6 h post-UV. Mt is the minimum number of the genes in a local cluster needed to have a profile candidate deemed as a pattern. The value of Mt affects the pattern extraction outcome. Higher Mt values may cause a meaningful pattern with a lower number of co-expressed genes to be concealed. On the other hand, lower Mt values may lead to the extraction of some patterns lacking biological meaningfulness. To test for an optimal Mt setting, we varied its values from 2 to 19 and performed EPIG analysis on the IR-treated gene expression data. Figure 6 shows that the average Pattern SNR increased with the increase of Mt, while the number of extracted patterns decreased. This result is seems plausible considering the observation that as Mt increases, more correlation local clusters are filtered-out since their cluster sizes are less than Mt. The fewer extracted patterns then have higher averaged SNRs. However, when Mt ≥ 6, the SNR had an up-shift and the number of extracted patterns had a down-shift. This result prompted us to set Mt to 6 in the given data set. To be precise, one should vary these thresholds empirically for a given data set to examine the outcomes. We have done just that and have concluded that, upon many sets of the gene expression data analyzed by using EPIG, selections of Mt at 6 and Rt at 0.8 have worked reasonably well (data not shown). Incidentally, there may be some genes with profiles not similar to any other gene(s) or their related local cluster had a size less than Mt. Then these "orphan" genes (singletons) will not be considered as a pattern candidate nor will they be categorized to any extracted patterns. Attention certainly needs to be paid to these orphan genes, as a part of the EPIG analysis result, to determine if they have a unique role in the treatment response.


Extracting gene expression patterns and identifying co-expressed genes from microarray data reveals biologically responsive processes.

Chou JW, Zhou T, Kaufmann WK, Paules RS, Bushel PR - BMC Bioinformatics (2007)

Optimization of the Mt value. Cluster size threshold Mt (the horizontal axis) verses average of patterns' SNR (A) and number of extracted patterns (B).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2194742&req=5

Figure 6: Optimization of the Mt value. Cluster size threshold Mt (the horizontal axis) verses average of patterns' SNR (A) and number of extracted patterns (B).
Mentions: There are two main thresholds used in EPIG pattern extraction: the local cluster size threshold Mt and the correlation threshold Rt. Rt determines the closeness in similarity that is allowed among the extracted patterns. Depending on the sample size, one may determine Rt such that the most similar patterns possess clear response differences. For example, in Figure 4, Patterns 5 and 6 have a correlation r-value of 0.77. But the two contain genes with expression patterns that display a clear difference in the response to UV-induced DNA damage. In Pattern 5, gene expression was repressed only at 2 h post-UV while in Pattern 6, gene expression was repressed at both 2 and 6 h post-UV. Mt is the minimum number of the genes in a local cluster needed to have a profile candidate deemed as a pattern. The value of Mt affects the pattern extraction outcome. Higher Mt values may cause a meaningful pattern with a lower number of co-expressed genes to be concealed. On the other hand, lower Mt values may lead to the extraction of some patterns lacking biological meaningfulness. To test for an optimal Mt setting, we varied its values from 2 to 19 and performed EPIG analysis on the IR-treated gene expression data. Figure 6 shows that the average Pattern SNR increased with the increase of Mt, while the number of extracted patterns decreased. This result is seems plausible considering the observation that as Mt increases, more correlation local clusters are filtered-out since their cluster sizes are less than Mt. The fewer extracted patterns then have higher averaged SNRs. However, when Mt ≥ 6, the SNR had an up-shift and the number of extracted patterns had a down-shift. This result prompted us to set Mt to 6 in the given data set. To be precise, one should vary these thresholds empirically for a given data set to examine the outcomes. We have done just that and have concluded that, upon many sets of the gene expression data analyzed by using EPIG, selections of Mt at 6 and Rt at 0.8 have worked reasonably well (data not shown). Incidentally, there may be some genes with profiles not similar to any other gene(s) or their related local cluster had a size less than Mt. Then these "orphan" genes (singletons) will not be considered as a pattern candidate nor will they be categorized to any extracted patterns. Attention certainly needs to be paid to these orphan genes, as a part of the EPIG analysis result, to determine if they have a unique role in the treatment response.

Bottom Line: However, the variation associated with microarray data and the complexity of the experimental designs make the acquisition of co-expressed genes a challenge.Among them were mitotic cell cycle, DNA replication, DNA repair, cell cycle checkpoint, and G0-like status transition.EPIG can be applied to data sets from a variety of experimental designs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Microarray Group, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, USA. chou@niehs.nih.gov

ABSTRACT

Background: A common observation in the analysis of gene expression data is that many genes display similarity in their expression patterns and therefore appear to be co-regulated. However, the variation associated with microarray data and the complexity of the experimental designs make the acquisition of co-expressed genes a challenge. We developed a novel method for Extracting microarray gene expression Patterns and Identifying co-expressed Genes, designated as EPIG. The approach utilizes the underlying structure of gene expression data to extract patterns and identify co-expressed genes that are responsive to experimental conditions.

Results: Through evaluation of the correlations among profiles, the magnitude of variation in gene expression profiles, and profile signal-to-noise ratio's, EPIG extracts a set of patterns representing co-expressed genes. The method is shown to work well with a simulated data set and microarray data obtained from time-series studies of dauer recovery and L1 starvation in C. elegans and after ultraviolet (UV) or ionizing radiation (IR)-induced DNA damage in diploid human fibroblasts. With the simulated data set, EPIG extracted the appropriate number of patterns which were more stable and homogeneous than the set of patterns that were determined using the CLICK or CAST clustering algorithms. However, CLICK performed better than EPIG and CAST with respect to the average correlation between clusters/patterns of the simulated data. With real biological data, EPIG extracted more dauer-specific patterns than CLICK. Furthermore, analysis of the IR/UV data revealed 18 unique patterns and 2661 genes out of approximately 17,000 that were identified as significantly expressed and categorized to the patterns by EPIG. The time-dependent patterns displayed similar and dissimilar responses between IR and UV treatments. Gene Ontology analysis applied to each pattern-related subset of co-expressed genes revealed underlying biological processes affected by IR- and/or UV- induced DNA damage.

Conclusion: EPIG competed with CLICK and performed better than CAST in extracting patterns from simulated data. EPIG extracted more biological informative patterns and co-expressed genes from both C. elegans and IR/UV-treated human fibroblasts. Using Gene Ontology analysis of the genes in the patterns extracted by EPIG, several key biological categories related to p53-dependent cell cycle control were revealed from the IR/UV data. Among them were mitotic cell cycle, DNA replication, DNA repair, cell cycle checkpoint, and G0-like status transition. EPIG can be applied to data sets from a variety of experimental designs.

Show MeSH
Related in: MedlinePlus