Limits...
Extracting gene expression patterns and identifying co-expressed genes from microarray data reveals biologically responsive processes.

Chou JW, Zhou T, Kaufmann WK, Paules RS, Bushel PR - BMC Bioinformatics (2007)

Bottom Line: However, the variation associated with microarray data and the complexity of the experimental designs make the acquisition of co-expressed genes a challenge.Among them were mitotic cell cycle, DNA replication, DNA repair, cell cycle checkpoint, and G0-like status transition.EPIG can be applied to data sets from a variety of experimental designs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Microarray Group, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, USA. chou@niehs.nih.gov

ABSTRACT

Background: A common observation in the analysis of gene expression data is that many genes display similarity in their expression patterns and therefore appear to be co-regulated. However, the variation associated with microarray data and the complexity of the experimental designs make the acquisition of co-expressed genes a challenge. We developed a novel method for Extracting microarray gene expression Patterns and Identifying co-expressed Genes, designated as EPIG. The approach utilizes the underlying structure of gene expression data to extract patterns and identify co-expressed genes that are responsive to experimental conditions.

Results: Through evaluation of the correlations among profiles, the magnitude of variation in gene expression profiles, and profile signal-to-noise ratio's, EPIG extracts a set of patterns representing co-expressed genes. The method is shown to work well with a simulated data set and microarray data obtained from time-series studies of dauer recovery and L1 starvation in C. elegans and after ultraviolet (UV) or ionizing radiation (IR)-induced DNA damage in diploid human fibroblasts. With the simulated data set, EPIG extracted the appropriate number of patterns which were more stable and homogeneous than the set of patterns that were determined using the CLICK or CAST clustering algorithms. However, CLICK performed better than EPIG and CAST with respect to the average correlation between clusters/patterns of the simulated data. With real biological data, EPIG extracted more dauer-specific patterns than CLICK. Furthermore, analysis of the IR/UV data revealed 18 unique patterns and 2661 genes out of approximately 17,000 that were identified as significantly expressed and categorized to the patterns by EPIG. The time-dependent patterns displayed similar and dissimilar responses between IR and UV treatments. Gene Ontology analysis applied to each pattern-related subset of co-expressed genes revealed underlying biological processes affected by IR- and/or UV- induced DNA damage.

Conclusion: EPIG competed with CLICK and performed better than CAST in extracting patterns from simulated data. EPIG extracted more biological informative patterns and co-expressed genes from both C. elegans and IR/UV-treated human fibroblasts. Using Gene Ontology analysis of the genes in the patterns extracted by EPIG, several key biological categories related to p53-dependent cell cycle control were revealed from the IR/UV data. Among them were mitotic cell cycle, DNA replication, DNA repair, cell cycle checkpoint, and G0-like status transition. EPIG can be applied to data sets from a variety of experimental designs.

Show MeSH

Related in: MedlinePlus

Plot of first three components of a PCA using 90 simulated profiles. The six clusters, from A to F, labelled in different colors correspond to the distributions from A to F in table 1. Each of the clusters consists of 15 profiles generated. 84.3% of the variability in the data was captured by the first 3 principal components (PCs). The x-axis is PC1, the y-axis PC2 and the z-axis PC3.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2194742&req=5

Figure 2: Plot of first three components of a PCA using 90 simulated profiles. The six clusters, from A to F, labelled in different colors correspond to the distributions from A to F in table 1. Each of the clusters consists of 15 profiles generated. 84.3% of the variability in the data was captured by the first 3 principal components (PCs). The x-axis is PC1, the y-axis PC2 and the z-axis PC3.

Mentions: Table 1 lists the mean value distributions used in a simulation of data where the standard deviation was set to be constant at 0.4. Figure 1 displays the mean values of the six probability distributions for generating the data, where Figure 1(A) and 1(B) are monotonically up and down, respectively. Figure 1(C) and 1(D) start and end at zero but peak at the third and second data points respectively. Figure 1(E) and 1(F) are flat at 3 and 0, respectively. The mean values equal to zero in Figure 1(F) reflect a flat response analogous to real gene expression data in which a number of genes may be not responsive to a given series of treatments. Normal deviates were drawn at random to generate 15 profiles for each of the six distributions. Figure 2 is a principal component analysis (PCA) of the simulated data, where the six distinct clusters of the profiles are distributed in 3-dimensional space. When the simulated data was analyzed by EPIG and CLICK for comparison, EPIG extracted 5 patterns from the data (see Figure 3A) corresponding to profile probability distributions depicted in Figures 1(A) to 1(E) and assigned all of the profiles to their proper patterns (15 profiles each) with 100% accuracy, except for the pattern from the distribution of the data shown in Figure 1(F) and its corresponding and uncategorized profiles (the ones generated to best represent a nonresponsive gene expression pattern). On the other hand, CLICK with the default homogeneity setting, only returned 3 clusters with the patterns of the centroids as shown in Figure 3B from the data omitting the clusters from the distribution of the data shown in Figure 1(E) which had all inter-group mean values at 3. CLICK merged the profiles from the patterns of the distributions of the data of Figure 1(C) and 1(D) together, despite the peaks at the distinct data points in the patterns. In addition, CLICK assigned 32 profiles (two of them from the distribution E in Table 1 or Figure 1(E)) to its Cluster 1 and 16 profiles (one of them also from the distribution E) to its Cluster 2. We also varied the homogeneity setting. With homogeneity settings at 0.83 or 0.84, CLICK generated the highest overall average homogeneity within the patterns and produced essentially the same three clusters as were produced using the default setting (data not shown).


Extracting gene expression patterns and identifying co-expressed genes from microarray data reveals biologically responsive processes.

Chou JW, Zhou T, Kaufmann WK, Paules RS, Bushel PR - BMC Bioinformatics (2007)

Plot of first three components of a PCA using 90 simulated profiles. The six clusters, from A to F, labelled in different colors correspond to the distributions from A to F in table 1. Each of the clusters consists of 15 profiles generated. 84.3% of the variability in the data was captured by the first 3 principal components (PCs). The x-axis is PC1, the y-axis PC2 and the z-axis PC3.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2194742&req=5

Figure 2: Plot of first three components of a PCA using 90 simulated profiles. The six clusters, from A to F, labelled in different colors correspond to the distributions from A to F in table 1. Each of the clusters consists of 15 profiles generated. 84.3% of the variability in the data was captured by the first 3 principal components (PCs). The x-axis is PC1, the y-axis PC2 and the z-axis PC3.
Mentions: Table 1 lists the mean value distributions used in a simulation of data where the standard deviation was set to be constant at 0.4. Figure 1 displays the mean values of the six probability distributions for generating the data, where Figure 1(A) and 1(B) are monotonically up and down, respectively. Figure 1(C) and 1(D) start and end at zero but peak at the third and second data points respectively. Figure 1(E) and 1(F) are flat at 3 and 0, respectively. The mean values equal to zero in Figure 1(F) reflect a flat response analogous to real gene expression data in which a number of genes may be not responsive to a given series of treatments. Normal deviates were drawn at random to generate 15 profiles for each of the six distributions. Figure 2 is a principal component analysis (PCA) of the simulated data, where the six distinct clusters of the profiles are distributed in 3-dimensional space. When the simulated data was analyzed by EPIG and CLICK for comparison, EPIG extracted 5 patterns from the data (see Figure 3A) corresponding to profile probability distributions depicted in Figures 1(A) to 1(E) and assigned all of the profiles to their proper patterns (15 profiles each) with 100% accuracy, except for the pattern from the distribution of the data shown in Figure 1(F) and its corresponding and uncategorized profiles (the ones generated to best represent a nonresponsive gene expression pattern). On the other hand, CLICK with the default homogeneity setting, only returned 3 clusters with the patterns of the centroids as shown in Figure 3B from the data omitting the clusters from the distribution of the data shown in Figure 1(E) which had all inter-group mean values at 3. CLICK merged the profiles from the patterns of the distributions of the data of Figure 1(C) and 1(D) together, despite the peaks at the distinct data points in the patterns. In addition, CLICK assigned 32 profiles (two of them from the distribution E in Table 1 or Figure 1(E)) to its Cluster 1 and 16 profiles (one of them also from the distribution E) to its Cluster 2. We also varied the homogeneity setting. With homogeneity settings at 0.83 or 0.84, CLICK generated the highest overall average homogeneity within the patterns and produced essentially the same three clusters as were produced using the default setting (data not shown).

Bottom Line: However, the variation associated with microarray data and the complexity of the experimental designs make the acquisition of co-expressed genes a challenge.Among them were mitotic cell cycle, DNA replication, DNA repair, cell cycle checkpoint, and G0-like status transition.EPIG can be applied to data sets from a variety of experimental designs.

View Article: PubMed Central - HTML - PubMed

Affiliation: Microarray Group, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, USA. chou@niehs.nih.gov

ABSTRACT

Background: A common observation in the analysis of gene expression data is that many genes display similarity in their expression patterns and therefore appear to be co-regulated. However, the variation associated with microarray data and the complexity of the experimental designs make the acquisition of co-expressed genes a challenge. We developed a novel method for Extracting microarray gene expression Patterns and Identifying co-expressed Genes, designated as EPIG. The approach utilizes the underlying structure of gene expression data to extract patterns and identify co-expressed genes that are responsive to experimental conditions.

Results: Through evaluation of the correlations among profiles, the magnitude of variation in gene expression profiles, and profile signal-to-noise ratio's, EPIG extracts a set of patterns representing co-expressed genes. The method is shown to work well with a simulated data set and microarray data obtained from time-series studies of dauer recovery and L1 starvation in C. elegans and after ultraviolet (UV) or ionizing radiation (IR)-induced DNA damage in diploid human fibroblasts. With the simulated data set, EPIG extracted the appropriate number of patterns which were more stable and homogeneous than the set of patterns that were determined using the CLICK or CAST clustering algorithms. However, CLICK performed better than EPIG and CAST with respect to the average correlation between clusters/patterns of the simulated data. With real biological data, EPIG extracted more dauer-specific patterns than CLICK. Furthermore, analysis of the IR/UV data revealed 18 unique patterns and 2661 genes out of approximately 17,000 that were identified as significantly expressed and categorized to the patterns by EPIG. The time-dependent patterns displayed similar and dissimilar responses between IR and UV treatments. Gene Ontology analysis applied to each pattern-related subset of co-expressed genes revealed underlying biological processes affected by IR- and/or UV- induced DNA damage.

Conclusion: EPIG competed with CLICK and performed better than CAST in extracting patterns from simulated data. EPIG extracted more biological informative patterns and co-expressed genes from both C. elegans and IR/UV-treated human fibroblasts. Using Gene Ontology analysis of the genes in the patterns extracted by EPIG, several key biological categories related to p53-dependent cell cycle control were revealed from the IR/UV data. Among them were mitotic cell cycle, DNA replication, DNA repair, cell cycle checkpoint, and G0-like status transition. EPIG can be applied to data sets from a variety of experimental designs.

Show MeSH
Related in: MedlinePlus