Limits...
Design and analysis of ChIP-seq experiments for DNA-binding proteins.

Kharchenko PV, Tolstorukov MY, Park PJ - Nat. Biotechnol. (2008)

Bottom Line: To fill this gap, we propose an analysis pipeline specifically designed to detect protein-binding positions with high accuracy.Using previously reported data sets for three transcription factors, we illustrate methods for improving tag alignment and correcting for background signals.We also analyze the relationship between the depth of sequencing and characteristics of the detected binding positions, and provide a method for estimating the sequencing depth necessary for a desired coverage of protein binding sites.

View Article: PubMed Central - PubMed

Affiliation: Center for Biomedical Informatics, Harvard Medical School, 10 Shattuck St., Boston, Massachusetts 02115, USA.

ABSTRACT
Recent progress in massively parallel sequencing platforms has enabled genome-wide characterization of DNA-associated proteins using the combination of chromatin immunoprecipitation and sequencing (ChIP-seq). Although a variety of methods exist for analysis of the established alternative ChIP microarray (ChIP-chip), few approaches have been described for processing ChIP-seq data. To fill this gap, we propose an analysis pipeline specifically designed to detect protein-binding positions with high accuracy. Using previously reported data sets for three transcription factors, we illustrate methods for improving tag alignment and correcting for background signals. We compare the sensitivity and spatial precision of three peak detection algorithms with published methods, demonstrating gains in spatial precision when an asymmetric distribution of tags on positive and negative strands is considered. We also analyze the relationship between the depth of sequencing and characteristics of the detected binding positions, and provide a method for estimating the sequencing depth necessary for a desired coverage of protein binding sites.

Show MeSH

Related in: MedlinePlus

a. Main steps of the proposed ChIP-seq processing pipeline. b. A schematic illustration of ChIP-seq measurements. DNA is fragmented or digested, and fragments cross-linked to the protein of interest are selected with IP. The 5’ ends (squares) of the selected fragments are sequenced, typically forming groups of positive and negative strand tags on the two sides of the protected region. The dashed red line illustrates a fragment generated from a long cross-link that may account for the tag patterns observed in CTCF and STAT1 datasets. c. Tag distribution around a stable NRSF binding position. Vertical lines show the number of tags (right axis) whose 5’ position maps to a given location on positive (red) or negative (blue) strands. Positive and negative values on the y-axis are used to illustrate tags mapping to positive and negative strands respectively. The solid curves show tag density for each strand (left axis, based on Gaussian kernel with σ =15bp). d. Strand cross-correlation for the NRSF data. The y-axis shows Pearson linear correlation coefficient between genome-wide profiles of tag density of positive and negative strands, shifted relative to each-other by a distance specified on the x-axis. The peak position (red vertical line) indicates a typical distance separating positive- and negative-strand peaks associated with the stable binding positions.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2597701&req=5

Figure 1: a. Main steps of the proposed ChIP-seq processing pipeline. b. A schematic illustration of ChIP-seq measurements. DNA is fragmented or digested, and fragments cross-linked to the protein of interest are selected with IP. The 5’ ends (squares) of the selected fragments are sequenced, typically forming groups of positive and negative strand tags on the two sides of the protected region. The dashed red line illustrates a fragment generated from a long cross-link that may account for the tag patterns observed in CTCF and STAT1 datasets. c. Tag distribution around a stable NRSF binding position. Vertical lines show the number of tags (right axis) whose 5’ position maps to a given location on positive (red) or negative (blue) strands. Positive and negative values on the y-axis are used to illustrate tags mapping to positive and negative strands respectively. The solid curves show tag density for each strand (left axis, based on Gaussian kernel with σ =15bp). d. Strand cross-correlation for the NRSF data. The y-axis shows Pearson linear correlation coefficient between genome-wide profiles of tag density of positive and negative strands, shifted relative to each-other by a distance specified on the x-axis. The peak position (red vertical line) indicates a typical distance separating positive- and negative-strand peaks associated with the stable binding positions.

Mentions: Here we describe a data processing pipeline optimized for detection of localized protein binding positions from unpaired sequence reads (Figure 1a). We illustrate the proposed pipeline on datasets for genome-wide binding of NRSF2, CTCF13 and STAT111, produced using the Solexa platform. The alignment procedure is enhanced to maximize the number of informative tags, based on the strand-specific pattern of tag distribution expected around a binding position. Filtering and background corrections steps are used to lower false-positive rates. We compare performance of several novel and previously described computational methods for calling specific binding positions, and show that some methods provide higher specificity and position accuracy. The final step of the proposed pipeline examines the saturation level of detected binding positions to determine the amount of additional sequencing that may be necessary.


Design and analysis of ChIP-seq experiments for DNA-binding proteins.

Kharchenko PV, Tolstorukov MY, Park PJ - Nat. Biotechnol. (2008)

a. Main steps of the proposed ChIP-seq processing pipeline. b. A schematic illustration of ChIP-seq measurements. DNA is fragmented or digested, and fragments cross-linked to the protein of interest are selected with IP. The 5’ ends (squares) of the selected fragments are sequenced, typically forming groups of positive and negative strand tags on the two sides of the protected region. The dashed red line illustrates a fragment generated from a long cross-link that may account for the tag patterns observed in CTCF and STAT1 datasets. c. Tag distribution around a stable NRSF binding position. Vertical lines show the number of tags (right axis) whose 5’ position maps to a given location on positive (red) or negative (blue) strands. Positive and negative values on the y-axis are used to illustrate tags mapping to positive and negative strands respectively. The solid curves show tag density for each strand (left axis, based on Gaussian kernel with σ =15bp). d. Strand cross-correlation for the NRSF data. The y-axis shows Pearson linear correlation coefficient between genome-wide profiles of tag density of positive and negative strands, shifted relative to each-other by a distance specified on the x-axis. The peak position (red vertical line) indicates a typical distance separating positive- and negative-strand peaks associated with the stable binding positions.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2597701&req=5

Figure 1: a. Main steps of the proposed ChIP-seq processing pipeline. b. A schematic illustration of ChIP-seq measurements. DNA is fragmented or digested, and fragments cross-linked to the protein of interest are selected with IP. The 5’ ends (squares) of the selected fragments are sequenced, typically forming groups of positive and negative strand tags on the two sides of the protected region. The dashed red line illustrates a fragment generated from a long cross-link that may account for the tag patterns observed in CTCF and STAT1 datasets. c. Tag distribution around a stable NRSF binding position. Vertical lines show the number of tags (right axis) whose 5’ position maps to a given location on positive (red) or negative (blue) strands. Positive and negative values on the y-axis are used to illustrate tags mapping to positive and negative strands respectively. The solid curves show tag density for each strand (left axis, based on Gaussian kernel with σ =15bp). d. Strand cross-correlation for the NRSF data. The y-axis shows Pearson linear correlation coefficient between genome-wide profiles of tag density of positive and negative strands, shifted relative to each-other by a distance specified on the x-axis. The peak position (red vertical line) indicates a typical distance separating positive- and negative-strand peaks associated with the stable binding positions.
Mentions: Here we describe a data processing pipeline optimized for detection of localized protein binding positions from unpaired sequence reads (Figure 1a). We illustrate the proposed pipeline on datasets for genome-wide binding of NRSF2, CTCF13 and STAT111, produced using the Solexa platform. The alignment procedure is enhanced to maximize the number of informative tags, based on the strand-specific pattern of tag distribution expected around a binding position. Filtering and background corrections steps are used to lower false-positive rates. We compare performance of several novel and previously described computational methods for calling specific binding positions, and show that some methods provide higher specificity and position accuracy. The final step of the proposed pipeline examines the saturation level of detected binding positions to determine the amount of additional sequencing that may be necessary.

Bottom Line: To fill this gap, we propose an analysis pipeline specifically designed to detect protein-binding positions with high accuracy.Using previously reported data sets for three transcription factors, we illustrate methods for improving tag alignment and correcting for background signals.We also analyze the relationship between the depth of sequencing and characteristics of the detected binding positions, and provide a method for estimating the sequencing depth necessary for a desired coverage of protein binding sites.

View Article: PubMed Central - PubMed

Affiliation: Center for Biomedical Informatics, Harvard Medical School, 10 Shattuck St., Boston, Massachusetts 02115, USA.

ABSTRACT
Recent progress in massively parallel sequencing platforms has enabled genome-wide characterization of DNA-associated proteins using the combination of chromatin immunoprecipitation and sequencing (ChIP-seq). Although a variety of methods exist for analysis of the established alternative ChIP microarray (ChIP-chip), few approaches have been described for processing ChIP-seq data. To fill this gap, we propose an analysis pipeline specifically designed to detect protein-binding positions with high accuracy. Using previously reported data sets for three transcription factors, we illustrate methods for improving tag alignment and correcting for background signals. We compare the sensitivity and spatial precision of three peak detection algorithms with published methods, demonstrating gains in spatial precision when an asymmetric distribution of tags on positive and negative strands is considered. We also analyze the relationship between the depth of sequencing and characteristics of the detected binding positions, and provide a method for estimating the sequencing depth necessary for a desired coverage of protein binding sites.

Show MeSH
Related in: MedlinePlus