Limits...
NEXT-peak: a normal-exponential two-peak model for peak-calling in ChIP-seq data.

Kim NK, Jayatillake RV, Spouge JL - BMC Genomics (2013)

Bottom Line: The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location.The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region.NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Mathematics and Statistics Department, Old Dominion University, Norfolk, VA 23529, USA. nxkim@odu.edu

ABSTRACT

Background: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) can locate transcription factor binding sites on genomic scale. Although many models and programs are available to call peaks, none has dominated its competition in comparison studies.

Results: We propose a rigorous statistical model, the normal-exponential two-peak (NEXT-peak) model, which parallels the physical processes generating the empirical data, and which can naturally incorporate mappability information. The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location. The comparison study with existing programs on real ChIP-seq datasets (STAT1, NRSF, and ZNF143) demonstrates that the NEXT-peak model performs well both in calling peaks and locating them. The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region.

Conclusions: The NEXT-peak program calls peaks on any test dataset about as accurately as any other, but provides unusual accuracy in the estimated location of the peaks it calls. NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.

Show MeSH

Related in: MedlinePlus

Smooth scatterplots for estimated standard deviation vs. actual distance to nearest motif site. (a) STAT1. (b) NRSF. (c) ZNF143. Both axes truncated values at 250 bp. For all datasets, the estimated standard deviation and the actual distance to the nearest motif site had a positive Pearson correlation coefficient. Dark regions represent high densities of data points and small dots represent isolated points. The majority of data points are located at the bottom-left corner for all three panels, hence most standard deviation estimates are small and actual distances to the motif site are also small in general.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3672025&req=5

Figure 6: Smooth scatterplots for estimated standard deviation vs. actual distance to nearest motif site. (a) STAT1. (b) NRSF. (c) ZNF143. Both axes truncated values at 250 bp. For all datasets, the estimated standard deviation and the actual distance to the nearest motif site had a positive Pearson correlation coefficient. Dark regions represent high densities of data points and small dots represent isolated points. The majority of data points are located at the bottom-left corner for all three panels, hence most standard deviation estimates are small and actual distances to the motif site are also small in general.

Mentions: Unlike other programs, NEXT-peak indicates the accuracy of a peak’s estimated location by estimating the corresponding standard deviation. Figure 6 displays smooth scatterplots for the estimated standard deviation versus distance to the nearest candidate site. (The plot is truncated on both axes at 250 bp, to dampen the influence of outliers caused by the omission of a true site among the candidate sites.) Dark regions represent high densities of data points and small dots represent isolated points. Ideally, all points should fall near the line y= x. Most data points, however, are concentrated at the bottom-left corner for all three plots. That is, most standard deviation estimates are small and actual distances to the motif site tend to be small as well. The Pearson correlation coefficients corresponding to the smooth scatterplots are 0.43 (STAT1), 0.50 (NRSF), and 0.33 (ZNF143), indicating that although the relationship is rather weak, the estimated error is positively correlated with the actual distance between the estimated peak location and the motif site.


NEXT-peak: a normal-exponential two-peak model for peak-calling in ChIP-seq data.

Kim NK, Jayatillake RV, Spouge JL - BMC Genomics (2013)

Smooth scatterplots for estimated standard deviation vs. actual distance to nearest motif site. (a) STAT1. (b) NRSF. (c) ZNF143. Both axes truncated values at 250 bp. For all datasets, the estimated standard deviation and the actual distance to the nearest motif site had a positive Pearson correlation coefficient. Dark regions represent high densities of data points and small dots represent isolated points. The majority of data points are located at the bottom-left corner for all three panels, hence most standard deviation estimates are small and actual distances to the motif site are also small in general.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3672025&req=5

Figure 6: Smooth scatterplots for estimated standard deviation vs. actual distance to nearest motif site. (a) STAT1. (b) NRSF. (c) ZNF143. Both axes truncated values at 250 bp. For all datasets, the estimated standard deviation and the actual distance to the nearest motif site had a positive Pearson correlation coefficient. Dark regions represent high densities of data points and small dots represent isolated points. The majority of data points are located at the bottom-left corner for all three panels, hence most standard deviation estimates are small and actual distances to the motif site are also small in general.
Mentions: Unlike other programs, NEXT-peak indicates the accuracy of a peak’s estimated location by estimating the corresponding standard deviation. Figure 6 displays smooth scatterplots for the estimated standard deviation versus distance to the nearest candidate site. (The plot is truncated on both axes at 250 bp, to dampen the influence of outliers caused by the omission of a true site among the candidate sites.) Dark regions represent high densities of data points and small dots represent isolated points. Ideally, all points should fall near the line y= x. Most data points, however, are concentrated at the bottom-left corner for all three plots. That is, most standard deviation estimates are small and actual distances to the motif site tend to be small as well. The Pearson correlation coefficients corresponding to the smooth scatterplots are 0.43 (STAT1), 0.50 (NRSF), and 0.33 (ZNF143), indicating that although the relationship is rather weak, the estimated error is positively correlated with the actual distance between the estimated peak location and the motif site.

Bottom Line: The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location.The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region.NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Mathematics and Statistics Department, Old Dominion University, Norfolk, VA 23529, USA. nxkim@odu.edu

ABSTRACT

Background: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) can locate transcription factor binding sites on genomic scale. Although many models and programs are available to call peaks, none has dominated its competition in comparison studies.

Results: We propose a rigorous statistical model, the normal-exponential two-peak (NEXT-peak) model, which parallels the physical processes generating the empirical data, and which can naturally incorporate mappability information. The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location. The comparison study with existing programs on real ChIP-seq datasets (STAT1, NRSF, and ZNF143) demonstrates that the NEXT-peak model performs well both in calling peaks and locating them. The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region.

Conclusions: The NEXT-peak program calls peaks on any test dataset about as accurately as any other, but provides unusual accuracy in the estimated location of the peaks it calls. NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.

Show MeSH
Related in: MedlinePlus