Limits...
NEXT-peak: a normal-exponential two-peak model for peak-calling in ChIP-seq data.

Kim NK, Jayatillake RV, Spouge JL - BMC Genomics (2013)

Bottom Line: The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location.The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region.NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Mathematics and Statistics Department, Old Dominion University, Norfolk, VA 23529, USA. nxkim@odu.edu

ABSTRACT

Background: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) can locate transcription factor binding sites on genomic scale. Although many models and programs are available to call peaks, none has dominated its competition in comparison studies.

Results: We propose a rigorous statistical model, the normal-exponential two-peak (NEXT-peak) model, which parallels the physical processes generating the empirical data, and which can naturally incorporate mappability information. The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location. The comparison study with existing programs on real ChIP-seq datasets (STAT1, NRSF, and ZNF143) demonstrates that the NEXT-peak model performs well both in calling peaks and locating them. The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region.

Conclusions: The NEXT-peak program calls peaks on any test dataset about as accurately as any other, but provides unusual accuracy in the estimated location of the peaks it calls. NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.

Show MeSH

Related in: MedlinePlus

A plot of percentage of top peaks with motif. (a) STAT1. (b) NRSF. (c) ZNF143. Some curves were truncated, because QuEST called fewer than 3,000 peaks; and WTD and MTC, fewer than 4,000 peaks. (Large percentages are desirable.)
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3672025&req=5

Figure 3: A plot of percentage of top peaks with motif. (a) STAT1. (b) NRSF. (c) ZNF143. Some curves were truncated, because QuEST called fewer than 3,000 peaks; and WTD and MTC, fewer than 4,000 peaks. (Large percentages are desirable.)

Mentions: The previous section gives performance measures based on the top 2,000 peaks called by each program. Researchers might wish to compare the measures based on lists of top peaks with different truncations, e.g., lists truncated at rank 1,000, 4,000, or 10,000. FigureĀ 3 shows the precision (fraction of TPs among the top peaks) for top peaks truncated at ranks up to 10,000. For each rank r in the x-axis, the precision is computed for cumulative peaks between rank 1 and rank r. As expected, precision generally decreased with the length of the list. For STAT1, NEXT-peak had the largest precision (of all programs examined) over the full range of lengths, up to 10,000 peaks. For NRSF, NEXT-peak had nearly the best precision up to 4,500 peaks (and in fact, it had the best precision at 2,000 peak; see the previous section.); NEXT-peak had the best precision between 4,500 and 10,000 peaks. For ZNF143, NEXT-peak had near the best precision up and 10,000 peaks. For ZNF143, MACS performed similarly to NEXT-peak between 1,500 and 10,000 peaks, but MACS had a significantly poorer performance between 0 and 1,500 peaks compared to other programs.


NEXT-peak: a normal-exponential two-peak model for peak-calling in ChIP-seq data.

Kim NK, Jayatillake RV, Spouge JL - BMC Genomics (2013)

A plot of percentage of top peaks with motif. (a) STAT1. (b) NRSF. (c) ZNF143. Some curves were truncated, because QuEST called fewer than 3,000 peaks; and WTD and MTC, fewer than 4,000 peaks. (Large percentages are desirable.)
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3672025&req=5

Figure 3: A plot of percentage of top peaks with motif. (a) STAT1. (b) NRSF. (c) ZNF143. Some curves were truncated, because QuEST called fewer than 3,000 peaks; and WTD and MTC, fewer than 4,000 peaks. (Large percentages are desirable.)
Mentions: The previous section gives performance measures based on the top 2,000 peaks called by each program. Researchers might wish to compare the measures based on lists of top peaks with different truncations, e.g., lists truncated at rank 1,000, 4,000, or 10,000. FigureĀ 3 shows the precision (fraction of TPs among the top peaks) for top peaks truncated at ranks up to 10,000. For each rank r in the x-axis, the precision is computed for cumulative peaks between rank 1 and rank r. As expected, precision generally decreased with the length of the list. For STAT1, NEXT-peak had the largest precision (of all programs examined) over the full range of lengths, up to 10,000 peaks. For NRSF, NEXT-peak had nearly the best precision up to 4,500 peaks (and in fact, it had the best precision at 2,000 peak; see the previous section.); NEXT-peak had the best precision between 4,500 and 10,000 peaks. For ZNF143, NEXT-peak had near the best precision up and 10,000 peaks. For ZNF143, MACS performed similarly to NEXT-peak between 1,500 and 10,000 peaks, but MACS had a significantly poorer performance between 0 and 1,500 peaks compared to other programs.

Bottom Line: The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location.The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region.NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Mathematics and Statistics Department, Old Dominion University, Norfolk, VA 23529, USA. nxkim@odu.edu

ABSTRACT

Background: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) can locate transcription factor binding sites on genomic scale. Although many models and programs are available to call peaks, none has dominated its competition in comparison studies.

Results: We propose a rigorous statistical model, the normal-exponential two-peak (NEXT-peak) model, which parallels the physical processes generating the empirical data, and which can naturally incorporate mappability information. The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location. The comparison study with existing programs on real ChIP-seq datasets (STAT1, NRSF, and ZNF143) demonstrates that the NEXT-peak model performs well both in calling peaks and locating them. The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region.

Conclusions: The NEXT-peak program calls peaks on any test dataset about as accurately as any other, but provides unusual accuracy in the estimated location of the peaks it calls. NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.

Show MeSH
Related in: MedlinePlus