Limits...
Identification of co-occurring transcription factor binding sites from DNA sequence using clustered position weight matrices.

Oh YM, Kim JK, Choi S, Yoo JY - Nucleic Acids Res. (2011)

Bottom Line: The Z-scores of the area under a receiver operating characteristic curve (AUC) values of 368 TFs were calculated and used to statistically identify co-occurring regulatory motifs in the TF bound ChIP loci.Motifs that are co-occurring along with the empirical bindings of E2F, JUN or MYC have been evaluated, in the basal or stimulated condition.Results prove our method can be useful to systematically identify the co-occurring motifs of the TF for the given conditions.

View Article: PubMed Central - PubMed

Affiliation: Department of Life Sciences, Pohang University of Science and Technology, Pohang, Republic of Korea.

ABSTRACT
Accurate prediction of transcription factor binding sites (TFBSs) is a prerequisite for identifying cis-regulatory modules that underlie transcriptional regulatory circuits encoded in the genome. Here, we present a computational framework for detecting TFBSs, when multiple position weight matrices (PWMs) for a transcription factor are available. Grouping multiple PWMs of a transcription factor (TF) based on their sequence similarity improves the specificity of TFBS prediction, which was evaluated using multiple genome-wide ChIP-Seq data sets from 26 TFs. The Z-scores of the area under a receiver operating characteristic curve (AUC) values of 368 TFs were calculated and used to statistically identify co-occurring regulatory motifs in the TF bound ChIP loci. Motifs that are co-occurring along with the empirical bindings of E2F, JUN or MYC have been evaluated, in the basal or stimulated condition. Results prove our method can be useful to systematically identify the co-occurring motifs of the TF for the given conditions.

Show MeSH

Related in: MedlinePlus

Conversion of AUC to Z-scores. (A) Scatter plot showing the positive correlation between the mean AUC values and the mean GC content of PWM clusters (PWM cluster type 2). For each PWM cluster, the mean value of the AUC is taken over the 51 ChIP-Seq data sets. The mean GC content of a PWM cluster is computed by averaging the sum of the frequency of ‘G’ and ‘C’ over all the PWMs belonging to the PWM cluster. Red is the best-fit linear regression line with a positive slope. (B) Cluster analysis of 368 PWM clusters (PWM cluster type 2) from 51 ChIP-Seq data sets using AUC values. (C) Scatter plot of the mean Z-score and the mean GC content of PWM clusters. (D) Cluster analysis of 368 PWM clusters (PWM cluster type 2) from 51 ChIP-Seq data sets using Z-scores.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3300004&req=5

gkr1252-F4: Conversion of AUC to Z-scores. (A) Scatter plot showing the positive correlation between the mean AUC values and the mean GC content of PWM clusters (PWM cluster type 2). For each PWM cluster, the mean value of the AUC is taken over the 51 ChIP-Seq data sets. The mean GC content of a PWM cluster is computed by averaging the sum of the frequency of ‘G’ and ‘C’ over all the PWMs belonging to the PWM cluster. Red is the best-fit linear regression line with a positive slope. (B) Cluster analysis of 368 PWM clusters (PWM cluster type 2) from 51 ChIP-Seq data sets using AUC values. (C) Scatter plot of the mean Z-score and the mean GC content of PWM clusters. (D) Cluster analysis of 368 PWM clusters (PWM cluster type 2) from 51 ChIP-Seq data sets using Z-scores.

Mentions: If TF-Y binding to DNA is sequence specific and it cooperatively acts with TF-X, then a statistically significant portion of the binding sites for TF-Y will appear in the binding regions of the TF-X. Based on this assumption, we searched for binding sites that frequently co-occurred in the ChIP-Seq data set of TF-X. For this purpose, the AUC values for 368 TFs in the TF-PWM library were first computed using 51 ChIP-Seq data sets. However, due to the compositional bias of GC contents in the regulatory regions (48), the higher mean percentage of GC in each PWM cluster arbitrarily shows higher AUC values (Figure 4A). As a result, PWMs with high GC content were found to be highly enriched for most of the ChIP-Seq data sets (Figure 4B). To solve this problem, the AUC values were converted to a Z-score, where the sample mean and standard deviation of the AUC values are computed for each TF among a total of 51 ChIP-Seq data sets. The Z-score was not affected by the GC contents of the PWM clusters (Figure 4C and D).Figure 4.


Identification of co-occurring transcription factor binding sites from DNA sequence using clustered position weight matrices.

Oh YM, Kim JK, Choi S, Yoo JY - Nucleic Acids Res. (2011)

Conversion of AUC to Z-scores. (A) Scatter plot showing the positive correlation between the mean AUC values and the mean GC content of PWM clusters (PWM cluster type 2). For each PWM cluster, the mean value of the AUC is taken over the 51 ChIP-Seq data sets. The mean GC content of a PWM cluster is computed by averaging the sum of the frequency of ‘G’ and ‘C’ over all the PWMs belonging to the PWM cluster. Red is the best-fit linear regression line with a positive slope. (B) Cluster analysis of 368 PWM clusters (PWM cluster type 2) from 51 ChIP-Seq data sets using AUC values. (C) Scatter plot of the mean Z-score and the mean GC content of PWM clusters. (D) Cluster analysis of 368 PWM clusters (PWM cluster type 2) from 51 ChIP-Seq data sets using Z-scores.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3300004&req=5

gkr1252-F4: Conversion of AUC to Z-scores. (A) Scatter plot showing the positive correlation between the mean AUC values and the mean GC content of PWM clusters (PWM cluster type 2). For each PWM cluster, the mean value of the AUC is taken over the 51 ChIP-Seq data sets. The mean GC content of a PWM cluster is computed by averaging the sum of the frequency of ‘G’ and ‘C’ over all the PWMs belonging to the PWM cluster. Red is the best-fit linear regression line with a positive slope. (B) Cluster analysis of 368 PWM clusters (PWM cluster type 2) from 51 ChIP-Seq data sets using AUC values. (C) Scatter plot of the mean Z-score and the mean GC content of PWM clusters. (D) Cluster analysis of 368 PWM clusters (PWM cluster type 2) from 51 ChIP-Seq data sets using Z-scores.
Mentions: If TF-Y binding to DNA is sequence specific and it cooperatively acts with TF-X, then a statistically significant portion of the binding sites for TF-Y will appear in the binding regions of the TF-X. Based on this assumption, we searched for binding sites that frequently co-occurred in the ChIP-Seq data set of TF-X. For this purpose, the AUC values for 368 TFs in the TF-PWM library were first computed using 51 ChIP-Seq data sets. However, due to the compositional bias of GC contents in the regulatory regions (48), the higher mean percentage of GC in each PWM cluster arbitrarily shows higher AUC values (Figure 4A). As a result, PWMs with high GC content were found to be highly enriched for most of the ChIP-Seq data sets (Figure 4B). To solve this problem, the AUC values were converted to a Z-score, where the sample mean and standard deviation of the AUC values are computed for each TF among a total of 51 ChIP-Seq data sets. The Z-score was not affected by the GC contents of the PWM clusters (Figure 4C and D).Figure 4.

Bottom Line: The Z-scores of the area under a receiver operating characteristic curve (AUC) values of 368 TFs were calculated and used to statistically identify co-occurring regulatory motifs in the TF bound ChIP loci.Motifs that are co-occurring along with the empirical bindings of E2F, JUN or MYC have been evaluated, in the basal or stimulated condition.Results prove our method can be useful to systematically identify the co-occurring motifs of the TF for the given conditions.

View Article: PubMed Central - PubMed

Affiliation: Department of Life Sciences, Pohang University of Science and Technology, Pohang, Republic of Korea.

ABSTRACT
Accurate prediction of transcription factor binding sites (TFBSs) is a prerequisite for identifying cis-regulatory modules that underlie transcriptional regulatory circuits encoded in the genome. Here, we present a computational framework for detecting TFBSs, when multiple position weight matrices (PWMs) for a transcription factor are available. Grouping multiple PWMs of a transcription factor (TF) based on their sequence similarity improves the specificity of TFBS prediction, which was evaluated using multiple genome-wide ChIP-Seq data sets from 26 TFs. The Z-scores of the area under a receiver operating characteristic curve (AUC) values of 368 TFs were calculated and used to statistically identify co-occurring regulatory motifs in the TF bound ChIP loci. Motifs that are co-occurring along with the empirical bindings of E2F, JUN or MYC have been evaluated, in the basal or stimulated condition. Results prove our method can be useful to systematically identify the co-occurring motifs of the TF for the given conditions.

Show MeSH
Related in: MedlinePlus