Limits...
Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome.

Kuznetsov VA, Singh O, Jenjaroenpun P - BMC Genomics (2010)

Bottom Line: Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods.Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques.Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genome and Gene Expression Data Analysis, Bioinformatics Institute, 30 Biopolis str, Singapore. vladimirk@bii.a-star.edu.sg

ABSTRACT

Background: Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type.

Results: We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes.

Conclusion: We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

Show MeSH

Related in: MedlinePlus

Multiple occurrence of c-Myc E-boxes in promoter region around transcription start site (TSS) of c-Myc target genes identified in ChIP-seq experiment. A: WEE 1 homolog 1(Wee1). The high-avidity (height peak = 76) c-Myc binding sites (pointed by blue arrow) in strong promoter region of Wee1 gene is supported with two canonical E-box CACGTG. The binding locus and E-boxes are overlapped with CpG Island which might be bound by c-Myc. B: Nucleoplasmin 3 (Npm3). The moderate-avidity (height peak = 10) as in the previous case the ChIP-seq c-Myc binding locus is located in the first intron promoter region. The locus is supported with five E-boxes: one canonical and three non-canonical E-boxes in first intron and another canonical E-box in second exon. In addition, this binding region and the E-boxes are located in CpG Island. C: FK506 binding protein 5 (Fkbp5). Two relatively low avidity c-Myc binding sites identified in ChIP-seq experiment confirmed with E-boxes. First ChIP-seq loci in upstream gene region has relatively low avidity biding site (height peak = 7) supported with canonical E-box CACGTG. The second ChIP-seq loci located in first intron and it is also relatively low avidity peak (height peak = 7) which is supported with two non-canonical E-boxes CACGCG and two non-canonical E-boxes CGCGAG. The last locus overlaps with CpG Island which suggests that this locus might be functional.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2822526&req=5

Figure 6: Multiple occurrence of c-Myc E-boxes in promoter region around transcription start site (TSS) of c-Myc target genes identified in ChIP-seq experiment. A: WEE 1 homolog 1(Wee1). The high-avidity (height peak = 76) c-Myc binding sites (pointed by blue arrow) in strong promoter region of Wee1 gene is supported with two canonical E-box CACGTG. The binding locus and E-boxes are overlapped with CpG Island which might be bound by c-Myc. B: Nucleoplasmin 3 (Npm3). The moderate-avidity (height peak = 10) as in the previous case the ChIP-seq c-Myc binding locus is located in the first intron promoter region. The locus is supported with five E-boxes: one canonical and three non-canonical E-boxes in first intron and another canonical E-box in second exon. In addition, this binding region and the E-boxes are located in CpG Island. C: FK506 binding protein 5 (Fkbp5). Two relatively low avidity c-Myc binding sites identified in ChIP-seq experiment confirmed with E-boxes. First ChIP-seq loci in upstream gene region has relatively low avidity biding site (height peak = 7) supported with canonical E-box CACGTG. The second ChIP-seq loci located in first intron and it is also relatively low avidity peak (height peak = 7) which is supported with two non-canonical E-boxes CACGCG and two non-canonical E-boxes CGCGAG. The last locus overlaps with CpG Island which suggests that this locus might be functional.

Mentions: Figure 6A presents a genomic map of c-Myc E-boxes distribution in the WEE1 homolog 1 (Wee1) gene region. ChIP-seq c-Myc binding locus (pointed by blue arrow) should be considered as high-avidity TFBS due to peak height = 76. This binding locus contains E-box sequences of two canonical E-box CACGTG presented on Figure 6A. Wee1 was reported to be an important component of hESC cell cycle regulation pathway [22] and might be co-regulated with c-Myc [23]. In Figure 6B, a genomic region of Nucleoplasmin 3 (Npm3), a chromatin re-modelling protein responsible for the unique chromatin structure and replicative capacity of ES cells [24], is presented. A BS with moderate-avidity (height peak = 10), of c-Myc is located in the region nearby the first intron. The locus contains five E-box sequences: one canonical and three non-canonical E-boxes in the first intron and another canonical E-box in the second exon. In addition, this BS and the E-boxes overlap with a CpG island. Figure 6C shows c-Myc low avidity BS within FKBP5 (FK506 binding protein 5) promoter region. Two relatively low avidity c-Myc BSs contain c-Myc E-boxes. First BS located in the upstream region of the gene has a relatively low avidity (overlap peak value 7) and contains a canonical E-box CACGTG. The second BS located in the first intron also has relatively low avidity (overlap peak value 7) and contains two non-canonical E-boxes CACGCG and two non-canonical E-boxes CGCGAG. The last BS overlaps with a CpG Island which suggests that this locus might be related to the regulation of the transcription of the gene.


Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome.

Kuznetsov VA, Singh O, Jenjaroenpun P - BMC Genomics (2010)

Multiple occurrence of c-Myc E-boxes in promoter region around transcription start site (TSS) of c-Myc target genes identified in ChIP-seq experiment. A: WEE 1 homolog 1(Wee1). The high-avidity (height peak = 76) c-Myc binding sites (pointed by blue arrow) in strong promoter region of Wee1 gene is supported with two canonical E-box CACGTG. The binding locus and E-boxes are overlapped with CpG Island which might be bound by c-Myc. B: Nucleoplasmin 3 (Npm3). The moderate-avidity (height peak = 10) as in the previous case the ChIP-seq c-Myc binding locus is located in the first intron promoter region. The locus is supported with five E-boxes: one canonical and three non-canonical E-boxes in first intron and another canonical E-box in second exon. In addition, this binding region and the E-boxes are located in CpG Island. C: FK506 binding protein 5 (Fkbp5). Two relatively low avidity c-Myc binding sites identified in ChIP-seq experiment confirmed with E-boxes. First ChIP-seq loci in upstream gene region has relatively low avidity biding site (height peak = 7) supported with canonical E-box CACGTG. The second ChIP-seq loci located in first intron and it is also relatively low avidity peak (height peak = 7) which is supported with two non-canonical E-boxes CACGCG and two non-canonical E-boxes CGCGAG. The last locus overlaps with CpG Island which suggests that this locus might be functional.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2822526&req=5

Figure 6: Multiple occurrence of c-Myc E-boxes in promoter region around transcription start site (TSS) of c-Myc target genes identified in ChIP-seq experiment. A: WEE 1 homolog 1(Wee1). The high-avidity (height peak = 76) c-Myc binding sites (pointed by blue arrow) in strong promoter region of Wee1 gene is supported with two canonical E-box CACGTG. The binding locus and E-boxes are overlapped with CpG Island which might be bound by c-Myc. B: Nucleoplasmin 3 (Npm3). The moderate-avidity (height peak = 10) as in the previous case the ChIP-seq c-Myc binding locus is located in the first intron promoter region. The locus is supported with five E-boxes: one canonical and three non-canonical E-boxes in first intron and another canonical E-box in second exon. In addition, this binding region and the E-boxes are located in CpG Island. C: FK506 binding protein 5 (Fkbp5). Two relatively low avidity c-Myc binding sites identified in ChIP-seq experiment confirmed with E-boxes. First ChIP-seq loci in upstream gene region has relatively low avidity biding site (height peak = 7) supported with canonical E-box CACGTG. The second ChIP-seq loci located in first intron and it is also relatively low avidity peak (height peak = 7) which is supported with two non-canonical E-boxes CACGCG and two non-canonical E-boxes CGCGAG. The last locus overlaps with CpG Island which suggests that this locus might be functional.
Mentions: Figure 6A presents a genomic map of c-Myc E-boxes distribution in the WEE1 homolog 1 (Wee1) gene region. ChIP-seq c-Myc binding locus (pointed by blue arrow) should be considered as high-avidity TFBS due to peak height = 76. This binding locus contains E-box sequences of two canonical E-box CACGTG presented on Figure 6A. Wee1 was reported to be an important component of hESC cell cycle regulation pathway [22] and might be co-regulated with c-Myc [23]. In Figure 6B, a genomic region of Nucleoplasmin 3 (Npm3), a chromatin re-modelling protein responsible for the unique chromatin structure and replicative capacity of ES cells [24], is presented. A BS with moderate-avidity (height peak = 10), of c-Myc is located in the region nearby the first intron. The locus contains five E-box sequences: one canonical and three non-canonical E-boxes in the first intron and another canonical E-box in the second exon. In addition, this BS and the E-boxes overlap with a CpG island. Figure 6C shows c-Myc low avidity BS within FKBP5 (FK506 binding protein 5) promoter region. Two relatively low avidity c-Myc BSs contain c-Myc E-boxes. First BS located in the upstream region of the gene has a relatively low avidity (overlap peak value 7) and contains a canonical E-box CACGTG. The second BS located in the first intron also has relatively low avidity (overlap peak value 7) and contains two non-canonical E-boxes CACGCG and two non-canonical E-boxes CGCGAG. The last BS overlaps with a CpG Island which suggests that this locus might be related to the regulation of the transcription of the gene.

Bottom Line: Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods.Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques.Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genome and Gene Expression Data Analysis, Bioinformatics Institute, 30 Biopolis str, Singapore. vladimirk@bii.a-star.edu.sg

ABSTRACT

Background: Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type.

Results: We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes.

Conclusion: We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

Show MeSH
Related in: MedlinePlus