Limits...
NEXT-peak: a normal-exponential two-peak model for peak-calling in ChIP-seq data.

Kim NK, Jayatillake RV, Spouge JL - BMC Genomics (2013)

Bottom Line: The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location.The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region.NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Mathematics and Statistics Department, Old Dominion University, Norfolk, VA 23529, USA. nxkim@odu.edu

ABSTRACT

Background: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) can locate transcription factor binding sites on genomic scale. Although many models and programs are available to call peaks, none has dominated its competition in comparison studies.

Results: We propose a rigorous statistical model, the normal-exponential two-peak (NEXT-peak) model, which parallels the physical processes generating the empirical data, and which can naturally incorporate mappability information. The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location. The comparison study with existing programs on real ChIP-seq datasets (STAT1, NRSF, and ZNF143) demonstrates that the NEXT-peak model performs well both in calling peaks and locating them. The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region.

Conclusions: The NEXT-peak program calls peaks on any test dataset about as accurately as any other, but provides unusual accuracy in the estimated location of the peaks it calls. NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.

Show MeSH

Related in: MedlinePlus

A flowchart for NEXT-peak algorithm. First, the program reads the mapped tags. Then, it selects regions based on the tag count with a user specified length of the window (default: 150) and a user specified minimum count (default: 15). If motif site locations are provided, it estimates σ and β using motif site locations. For an unknown motif, run the NEXT-peak program with default values (σ = 30 and β = 50), and identify the strongest motif from a motif search. For each region, the program estimates μ and ν. It computes the standard deviation estimates for these estimates. As a post-processing step, the program computes the region length and p-value cut-off recommendations to screen out potential spurious regions when the motif site information is available.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3672025&req=5

Figure 8: A flowchart for NEXT-peak algorithm. First, the program reads the mapped tags. Then, it selects regions based on the tag count with a user specified length of the window (default: 150) and a user specified minimum count (default: 15). If motif site locations are provided, it estimates σ and β using motif site locations. For an unknown motif, run the NEXT-peak program with default values (σ = 30 and β = 50), and identify the strongest motif from a motif search. For each region, the program estimates μ and ν. It computes the standard deviation estimates for these estimates. As a post-processing step, the program computes the region length and p-value cut-off recommendations to screen out potential spurious regions when the motif site information is available.

Mentions: The NEXT-peak program goes through the following procedures for producing an output (Figure 8 shows a flowchart for the NEXT-peak program). (1) Read the mapped tag location file, e.g., from a Bowtie [17] output. (2) Select regions based on the tag count with a user specified length of the window (default: 150) and a user specified minimum count (default: 15). When a neighboring window has more than the minimum count, the window under scrutiny is combined with its neighbor. The region lengths range from the minimum length (default: 150) to several thousands. (3) When motif site locations are available, estimate σ and β by maximizing the likelihood using motif site locations. For a known TF, a publicly available motif pattern is used, e.g. from JASPAR [16]. For an unknown motif, run the NEXT-peak program with default values (σ = 30 and β = 50), and identify the strongest motif from a motif search. Alternatively, a user can estimate these parameters with selected regions. (4) For each region, estimate μ and ν by maximizing the likelihood. It computes the standard deviation estimates for these estimates. Then, perform a goodness-of-fit test for each region. (5) As a post-processing step, compute the region length and p-value cut-off recommendations to screen out potential spurious regions when the motif site locations are available.


NEXT-peak: a normal-exponential two-peak model for peak-calling in ChIP-seq data.

Kim NK, Jayatillake RV, Spouge JL - BMC Genomics (2013)

A flowchart for NEXT-peak algorithm. First, the program reads the mapped tags. Then, it selects regions based on the tag count with a user specified length of the window (default: 150) and a user specified minimum count (default: 15). If motif site locations are provided, it estimates σ and β using motif site locations. For an unknown motif, run the NEXT-peak program with default values (σ = 30 and β = 50), and identify the strongest motif from a motif search. For each region, the program estimates μ and ν. It computes the standard deviation estimates for these estimates. As a post-processing step, the program computes the region length and p-value cut-off recommendations to screen out potential spurious regions when the motif site information is available.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3672025&req=5

Figure 8: A flowchart for NEXT-peak algorithm. First, the program reads the mapped tags. Then, it selects regions based on the tag count with a user specified length of the window (default: 150) and a user specified minimum count (default: 15). If motif site locations are provided, it estimates σ and β using motif site locations. For an unknown motif, run the NEXT-peak program with default values (σ = 30 and β = 50), and identify the strongest motif from a motif search. For each region, the program estimates μ and ν. It computes the standard deviation estimates for these estimates. As a post-processing step, the program computes the region length and p-value cut-off recommendations to screen out potential spurious regions when the motif site information is available.
Mentions: The NEXT-peak program goes through the following procedures for producing an output (Figure 8 shows a flowchart for the NEXT-peak program). (1) Read the mapped tag location file, e.g., from a Bowtie [17] output. (2) Select regions based on the tag count with a user specified length of the window (default: 150) and a user specified minimum count (default: 15). When a neighboring window has more than the minimum count, the window under scrutiny is combined with its neighbor. The region lengths range from the minimum length (default: 150) to several thousands. (3) When motif site locations are available, estimate σ and β by maximizing the likelihood using motif site locations. For a known TF, a publicly available motif pattern is used, e.g. from JASPAR [16]. For an unknown motif, run the NEXT-peak program with default values (σ = 30 and β = 50), and identify the strongest motif from a motif search. Alternatively, a user can estimate these parameters with selected regions. (4) For each region, estimate μ and ν by maximizing the likelihood. It computes the standard deviation estimates for these estimates. Then, perform a goodness-of-fit test for each region. (5) As a post-processing step, compute the region length and p-value cut-off recommendations to screen out potential spurious regions when the motif site locations are available.

Bottom Line: The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location.The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region.NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Mathematics and Statistics Department, Old Dominion University, Norfolk, VA 23529, USA. nxkim@odu.edu

ABSTRACT

Background: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) can locate transcription factor binding sites on genomic scale. Although many models and programs are available to call peaks, none has dominated its competition in comparison studies.

Results: We propose a rigorous statistical model, the normal-exponential two-peak (NEXT-peak) model, which parallels the physical processes generating the empirical data, and which can naturally incorporate mappability information. The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location. The comparison study with existing programs on real ChIP-seq datasets (STAT1, NRSF, and ZNF143) demonstrates that the NEXT-peak model performs well both in calling peaks and locating them. The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region.

Conclusions: The NEXT-peak program calls peaks on any test dataset about as accurately as any other, but provides unusual accuracy in the estimated location of the peaks it calls. NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.

Show MeSH
Related in: MedlinePlus