Limits...
Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome.

Kuznetsov VA, Singh O, Jenjaroenpun P - BMC Genomics (2010)

Bottom Line: Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods.Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques.Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genome and Gene Expression Data Analysis, Bioinformatics Institute, 30 Biopolis str, Singapore. vladimirk@bii.a-star.edu.sg

ABSTRACT

Background: Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type.

Results: We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes.

Conclusion: We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

Show MeSH

Related in: MedlinePlus

Validation of ChIP-seq defined c-Myc binding loci based on motif finding analysis. A: PWM of c-Myc TFBSs defined with NestedMICA program trained with 12 peak height or higher value defined in ChIP-seq experiment. B: Distribution of E-box sequences in ± 1 kb from the centre of ChIP-seq defined binding loci. C: Frequency distribution of the number of ChIP-seq overlapped DNA fragments (peak height). ◊: All ChIP-seq c-Myc bound loci for observed peak heights. o: E-boxes positive loci found in vicinity ± 250 bp. ∇: E-boxes positive loci found in vicinity ± 150 bp. D: Venn diagram of co-occurrence of E-boxes in ± 150 bp of c-Myc binding loci (left side). Pair of Kappa correlation coefficient of co-occurrence of E-boxes in c-Myc binding loci (right side).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2822526&req=5

Figure 5: Validation of ChIP-seq defined c-Myc binding loci based on motif finding analysis. A: PWM of c-Myc TFBSs defined with NestedMICA program trained with 12 peak height or higher value defined in ChIP-seq experiment. B: Distribution of E-box sequences in ± 1 kb from the centre of ChIP-seq defined binding loci. C: Frequency distribution of the number of ChIP-seq overlapped DNA fragments (peak height). ◊: All ChIP-seq c-Myc bound loci for observed peak heights. o: E-boxes positive loci found in vicinity ± 250 bp. ∇: E-boxes positive loci found in vicinity ± 150 bp. D: Venn diagram of co-occurrence of E-boxes in ± 150 bp of c-Myc binding loci (left side). Pair of Kappa correlation coefficient of co-occurrence of E-boxes in c-Myc binding loci (right side).

Mentions: According to the prediction of the GDP and K-W models, the number of low- and moderate- avidity TFBSs should monotonously increase when the number of unique overlapping ChIP-seq DNA fragments in a binding locus becomes smaller. To validate this prediction, we used motif-discovery program nminfer (see Methods section) to identify c-Myc binding motifs/E-boxes in the ChIP-seq binding loci. As a training set we used the loci with c-Myc-DNA binding specificity cut-off values defined by GDP model at ≥12 sequences per locus representing high avidity loci. We found Position Weight Matrix (PWMs) which contains 'canonical' E-box CACGTG 'non-canonical' E-boxes including CACGCG, CGCGAG and CACATG [18] and a novel non-canonical E-box CGCGAG which is found in our study (Figure 5A). All four non-canonical motifs were scanned for exact matches in the mouse genome (mm8) using SeqMap software [21]. The total numbers of matches of CACGTG, CACGCG, CGCGAG and CACATG in the mouse genome were 262,133, 83,490, 41,394 and 2,451,550, respectively. E-box CACATG sequence is highly frequently occurred in non-genic low-complexity regions of the mouse genome associated with repeats elements or promiscuous genome regions. To minimize false-positive or bias in the results, we excluded the E-box CACATG from our validation and prediction analyses. Then we studied the localization of the E-boxes CACGTG, CACGCG, CGCGAG within ChIP-seq binding loci. To do that we construct the frequency distribution of the E-box sequences around the central nucleotide of ChIP-seq-defined c-Myc biding loci (Figure 5B). To identify the region of the E-box localization within binding loci, we narrowed the scan region with ± 150 bp (300 bp) and ± 250 bp (500 bp) around the center of c-Myc binding locus (Figure 5B).


Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome.

Kuznetsov VA, Singh O, Jenjaroenpun P - BMC Genomics (2010)

Validation of ChIP-seq defined c-Myc binding loci based on motif finding analysis. A: PWM of c-Myc TFBSs defined with NestedMICA program trained with 12 peak height or higher value defined in ChIP-seq experiment. B: Distribution of E-box sequences in ± 1 kb from the centre of ChIP-seq defined binding loci. C: Frequency distribution of the number of ChIP-seq overlapped DNA fragments (peak height). ◊: All ChIP-seq c-Myc bound loci for observed peak heights. o: E-boxes positive loci found in vicinity ± 250 bp. ∇: E-boxes positive loci found in vicinity ± 150 bp. D: Venn diagram of co-occurrence of E-boxes in ± 150 bp of c-Myc binding loci (left side). Pair of Kappa correlation coefficient of co-occurrence of E-boxes in c-Myc binding loci (right side).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2822526&req=5

Figure 5: Validation of ChIP-seq defined c-Myc binding loci based on motif finding analysis. A: PWM of c-Myc TFBSs defined with NestedMICA program trained with 12 peak height or higher value defined in ChIP-seq experiment. B: Distribution of E-box sequences in ± 1 kb from the centre of ChIP-seq defined binding loci. C: Frequency distribution of the number of ChIP-seq overlapped DNA fragments (peak height). ◊: All ChIP-seq c-Myc bound loci for observed peak heights. o: E-boxes positive loci found in vicinity ± 250 bp. ∇: E-boxes positive loci found in vicinity ± 150 bp. D: Venn diagram of co-occurrence of E-boxes in ± 150 bp of c-Myc binding loci (left side). Pair of Kappa correlation coefficient of co-occurrence of E-boxes in c-Myc binding loci (right side).
Mentions: According to the prediction of the GDP and K-W models, the number of low- and moderate- avidity TFBSs should monotonously increase when the number of unique overlapping ChIP-seq DNA fragments in a binding locus becomes smaller. To validate this prediction, we used motif-discovery program nminfer (see Methods section) to identify c-Myc binding motifs/E-boxes in the ChIP-seq binding loci. As a training set we used the loci with c-Myc-DNA binding specificity cut-off values defined by GDP model at ≥12 sequences per locus representing high avidity loci. We found Position Weight Matrix (PWMs) which contains 'canonical' E-box CACGTG 'non-canonical' E-boxes including CACGCG, CGCGAG and CACATG [18] and a novel non-canonical E-box CGCGAG which is found in our study (Figure 5A). All four non-canonical motifs were scanned for exact matches in the mouse genome (mm8) using SeqMap software [21]. The total numbers of matches of CACGTG, CACGCG, CGCGAG and CACATG in the mouse genome were 262,133, 83,490, 41,394 and 2,451,550, respectively. E-box CACATG sequence is highly frequently occurred in non-genic low-complexity regions of the mouse genome associated with repeats elements or promiscuous genome regions. To minimize false-positive or bias in the results, we excluded the E-box CACATG from our validation and prediction analyses. Then we studied the localization of the E-boxes CACGTG, CACGCG, CGCGAG within ChIP-seq binding loci. To do that we construct the frequency distribution of the E-box sequences around the central nucleotide of ChIP-seq-defined c-Myc biding loci (Figure 5B). To identify the region of the E-box localization within binding loci, we narrowed the scan region with ± 150 bp (300 bp) and ± 250 bp (500 bp) around the center of c-Myc binding locus (Figure 5B).

Bottom Line: Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods.Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques.Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genome and Gene Expression Data Analysis, Bioinformatics Institute, 30 Biopolis str, Singapore. vladimirk@bii.a-star.edu.sg

ABSTRACT

Background: Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type.

Results: We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes.

Conclusion: We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

Show MeSH
Related in: MedlinePlus