Limits...
Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome.

Kuznetsov VA, Singh O, Jenjaroenpun P - BMC Genomics (2010)

Bottom Line: Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods.Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques.Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genome and Gene Expression Data Analysis, Bioinformatics Institute, 30 Biopolis str, Singapore. vladimirk@bii.a-star.edu.sg

ABSTRACT

Background: Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type.

Results: We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes.

Conclusion: We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

Show MeSH

Related in: MedlinePlus

Three segments in the range of TF-DNA BEs count for 11 TFs of mouse E14 embryonic cells.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2822526&req=5

Figure 4: Three segments in the range of TF-DNA BEs count for 11 TFs of mouse E14 embryonic cells.

Mentions: Reliable data points in the right-tail region of the empirical frequency distribution (starting from our best-fit GDP-defined cut-off peak value; region N2 on Figure 2B, C, D) and the best-fit predicted GDP data points of the noise-rich (left size) region of the distribution (region N1, Figure 2B, Additional file 2) were combined together. K-W function fitting was used to estimate the fraction of non-detected BEs, p0, and the total number of TFBSs, Ntot (Ntot= N0+ N1 + N2). For example, for the Nanog dataset, at threshold 11, 1.02*104 reliable BSs were predicted by the GDP and K-W. Additionally, K-W function predicts in total 51114 Nanog BSs in the genome of E14 cells, where 10223 (20% of Ntot) were non-observed BSs (N0) in the library; 30678 (60% of Ntot) specific BSs where predicted in the noise-rich BE set (N1) and 10213 (20% of Ntot) were predicted in the reliable specific BE set (N2), respectively (Figure 4). The estimates of the parameters of GDP and K-W functions and Ntot for c-Myc, Esrrb and other TF data are given in Table 3 and, partially, in Figure 4.


Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome.

Kuznetsov VA, Singh O, Jenjaroenpun P - BMC Genomics (2010)

Three segments in the range of TF-DNA BEs count for 11 TFs of mouse E14 embryonic cells.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2822526&req=5

Figure 4: Three segments in the range of TF-DNA BEs count for 11 TFs of mouse E14 embryonic cells.
Mentions: Reliable data points in the right-tail region of the empirical frequency distribution (starting from our best-fit GDP-defined cut-off peak value; region N2 on Figure 2B, C, D) and the best-fit predicted GDP data points of the noise-rich (left size) region of the distribution (region N1, Figure 2B, Additional file 2) were combined together. K-W function fitting was used to estimate the fraction of non-detected BEs, p0, and the total number of TFBSs, Ntot (Ntot= N0+ N1 + N2). For example, for the Nanog dataset, at threshold 11, 1.02*104 reliable BSs were predicted by the GDP and K-W. Additionally, K-W function predicts in total 51114 Nanog BSs in the genome of E14 cells, where 10223 (20% of Ntot) were non-observed BSs (N0) in the library; 30678 (60% of Ntot) specific BSs where predicted in the noise-rich BE set (N1) and 10213 (20% of Ntot) were predicted in the reliable specific BE set (N2), respectively (Figure 4). The estimates of the parameters of GDP and K-W functions and Ntot for c-Myc, Esrrb and other TF data are given in Table 3 and, partially, in Figure 4.

Bottom Line: Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods.Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques.Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Genome and Gene Expression Data Analysis, Bioinformatics Institute, 30 Biopolis str, Singapore. vladimirk@bii.a-star.edu.sg

ABSTRACT

Background: Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type.

Results: We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes.

Conclusion: We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

Show MeSH
Related in: MedlinePlus