Limits...
SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps.

Setty M, Leslie CS - PLoS Comput. Biol. (2015)

Bottom Line: Benchmarked on over 100 ChIP-seq experiments, SeqGL outperformed traditional motif discovery tools in discriminative accuracy.Furthermore, SeqGL can be naturally used with multitask learning to identify genomic and cell-type context determinants of TF binding.In particular, SeqGL was able to identify a number of ChIP-seq validated sequence signals that were not found by traditional motif discovery algorithms.

View Article: PubMed Central - PubMed

Affiliation: Computational Biology Program, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America.

ABSTRACT
Genome-wide maps of transcription factor (TF) occupancy and regions of open chromatin implicitly contain DNA sequence signals for multiple factors. We present SeqGL, a novel de novo motif discovery algorithm to identify multiple TF sequence signals from ChIP-, DNase-, and ATAC-seq profiles. SeqGL trains a discriminative model using a k-mer feature representation together with group lasso regularization to extract a collection of sequence signals that distinguish peak sequences from flanking regions. Benchmarked on over 100 ChIP-seq experiments, SeqGL outperformed traditional motif discovery tools in discriminative accuracy. Furthermore, SeqGL can be naturally used with multitask learning to identify genomic and cell-type context determinants of TF binding. SeqGL successfully scales to the large multiplicity of sequence signals in DNase- or ATAC-seq maps. In particular, SeqGL was able to identify a number of ChIP-seq validated sequence signals that were not found by traditional motif discovery algorithms. Thus compared to widely used motif discovery algorithms, SeqGL demonstrates both greater discriminative accuracy and higher sensitivity for detecting the DNA sequence signals underlying regulatory element maps. SeqGL is available at http://cbio.mskcc.org/public/Leslie/SeqGL/.

No MeSH data available.


SeqGL identifies binding signals in ChIP-seq occupancy profiles.(A) Heatmaps show predicted binding signals and ChIP occupancy of PAX5 and co-factors. SeqGL analysis of PAX5 ChIP-seq predicts BATF and PU.1 as the most significant binding partners of PAX5. The top panel shows the group scores associated to three TFs from the PAX5 model, and the bottom panel shows the corresponding ChIP-seq read counts. This shows that a number of PAX5 ChIP-seq peaks are indirect and obtained through DNA binding of partners rather than PAX5 itself. The dashed boxes highlight the specific examples illustrated in Fig 3B. (B) Specific examples of PAX5 profiles show various modes of binding detected by SeqGL. The left panel shows direct binding of a TF (PAX5 recognizing its motif). The middle panel shows that the sequence signal is associated to BATF, and hence the PAX5 peak at this location is either due to interaction of the two factors and/or long distance looping. The right panel shows an example of co-binding of PAX5 and PU.1, each recognizing its respective binding motif.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4446265&req=5

pcbi.1004271.g003: SeqGL identifies binding signals in ChIP-seq occupancy profiles.(A) Heatmaps show predicted binding signals and ChIP occupancy of PAX5 and co-factors. SeqGL analysis of PAX5 ChIP-seq predicts BATF and PU.1 as the most significant binding partners of PAX5. The top panel shows the group scores associated to three TFs from the PAX5 model, and the bottom panel shows the corresponding ChIP-seq read counts. This shows that a number of PAX5 ChIP-seq peaks are indirect and obtained through DNA binding of partners rather than PAX5 itself. The dashed boxes highlight the specific examples illustrated in Fig 3B. (B) Specific examples of PAX5 profiles show various modes of binding detected by SeqGL. The left panel shows direct binding of a TF (PAX5 recognizing its motif). The middle panel shows that the sequence signal is associated to BATF, and hence the PAX5 peak at this location is either due to interaction of the two factors and/or long distance looping. The right panel shows an example of co-binding of PAX5 and PU.1, each recognizing its respective binding motif.

Mentions: PAX5 is an important B cell lineage factor expressed at early stages of B cell differentiation [21]. Therefore we used PAX5 ChIP-seq data in GM12878 to examine the co-factor binding profiles identified by SeqGL (Fig 3). SeqGL was run with 20 groups and identified 7 groups to be significantly predictive of peaks compared to flanks. The top panel of Fig 3A shows the group scores for three highest scoring groups ranked by their predictive power of peaks compared to flanks, and the bottom panel shows the ChIP-seq read counts for the corresponding TFs. While SeqGL identified 7 groups as predictive of PAX5 peaks, we are highlighting the three highest ranked groups for simplicity. As expected, the top-scoring group in the PAX5 ChIP-seq experiment identifies sites that are strongly associated with the canonical PAX5 motif. The other groups are associated with AP family and PU.1 motifs, which have prominent roles in B cell function [21]. We next used existing ChIP-seq data to validate these predictions (S2 Fig, Materials and Methods) and found that PAX5 peaks predicted by these two groups are indeed bound by AP family factors and PU.1 respectively. Furthermore we also identified BATF as the specific AP factors since peaks associated with this group are most enriched for BATF ChIP-seq peaks. Interestingly, even though the PAX5 ChIP-seq read densities are uniform across all peaks, the group scores show significant differences, and a significant number of PAX5 peaks do not have a sequence signal for PAX5. We propose that this observation is due to different modes of binding. Fig 3B shows specific examples of these modes of binding. The left panel shows the direct binding mode: a TF recognizing its canonical motif. The middle panel shows that even though there is a strong PAX5 peak, the sequence signal is actually derived from a different factor, BATF, indicating either indirect PAX5 binding via a protein-protein interaction or potentially a distal looping interaction. This illustrates that for a region with ChIP-seq peaks for multiple factors; the binding signals need only come from a subset of those factors. Finally the right panel demonstrates co-binding of PAX5 and PU.1 with each factor recognizing its respective motif. These observations are consistent with the different modes of interaction between TFs identified by Wang et al. [8]. The fraction of peaks with non-canonical signal is dependent on the TF; we observed a continuous spectrum across ChIP-seq experiments, with some TFs showing exclusively canonical signals and others showing a mix of canonical and non-canonical signals (S3 Fig and S3 Table). We note that gkm-SVM has a procedure for producing PSSMs from the top ranked k-mers in the model and reports three motifs per TF ChIP-seq experiments. However, when we compared results with SeqGL and HOMER, we saw that gkm-SVM missed many of the co-factor signals identified by SeqGL and indeed often returned three variants of the same motif (S4 Table).


SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps.

Setty M, Leslie CS - PLoS Comput. Biol. (2015)

SeqGL identifies binding signals in ChIP-seq occupancy profiles.(A) Heatmaps show predicted binding signals and ChIP occupancy of PAX5 and co-factors. SeqGL analysis of PAX5 ChIP-seq predicts BATF and PU.1 as the most significant binding partners of PAX5. The top panel shows the group scores associated to three TFs from the PAX5 model, and the bottom panel shows the corresponding ChIP-seq read counts. This shows that a number of PAX5 ChIP-seq peaks are indirect and obtained through DNA binding of partners rather than PAX5 itself. The dashed boxes highlight the specific examples illustrated in Fig 3B. (B) Specific examples of PAX5 profiles show various modes of binding detected by SeqGL. The left panel shows direct binding of a TF (PAX5 recognizing its motif). The middle panel shows that the sequence signal is associated to BATF, and hence the PAX5 peak at this location is either due to interaction of the two factors and/or long distance looping. The right panel shows an example of co-binding of PAX5 and PU.1, each recognizing its respective binding motif.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4446265&req=5

pcbi.1004271.g003: SeqGL identifies binding signals in ChIP-seq occupancy profiles.(A) Heatmaps show predicted binding signals and ChIP occupancy of PAX5 and co-factors. SeqGL analysis of PAX5 ChIP-seq predicts BATF and PU.1 as the most significant binding partners of PAX5. The top panel shows the group scores associated to three TFs from the PAX5 model, and the bottom panel shows the corresponding ChIP-seq read counts. This shows that a number of PAX5 ChIP-seq peaks are indirect and obtained through DNA binding of partners rather than PAX5 itself. The dashed boxes highlight the specific examples illustrated in Fig 3B. (B) Specific examples of PAX5 profiles show various modes of binding detected by SeqGL. The left panel shows direct binding of a TF (PAX5 recognizing its motif). The middle panel shows that the sequence signal is associated to BATF, and hence the PAX5 peak at this location is either due to interaction of the two factors and/or long distance looping. The right panel shows an example of co-binding of PAX5 and PU.1, each recognizing its respective binding motif.
Mentions: PAX5 is an important B cell lineage factor expressed at early stages of B cell differentiation [21]. Therefore we used PAX5 ChIP-seq data in GM12878 to examine the co-factor binding profiles identified by SeqGL (Fig 3). SeqGL was run with 20 groups and identified 7 groups to be significantly predictive of peaks compared to flanks. The top panel of Fig 3A shows the group scores for three highest scoring groups ranked by their predictive power of peaks compared to flanks, and the bottom panel shows the ChIP-seq read counts for the corresponding TFs. While SeqGL identified 7 groups as predictive of PAX5 peaks, we are highlighting the three highest ranked groups for simplicity. As expected, the top-scoring group in the PAX5 ChIP-seq experiment identifies sites that are strongly associated with the canonical PAX5 motif. The other groups are associated with AP family and PU.1 motifs, which have prominent roles in B cell function [21]. We next used existing ChIP-seq data to validate these predictions (S2 Fig, Materials and Methods) and found that PAX5 peaks predicted by these two groups are indeed bound by AP family factors and PU.1 respectively. Furthermore we also identified BATF as the specific AP factors since peaks associated with this group are most enriched for BATF ChIP-seq peaks. Interestingly, even though the PAX5 ChIP-seq read densities are uniform across all peaks, the group scores show significant differences, and a significant number of PAX5 peaks do not have a sequence signal for PAX5. We propose that this observation is due to different modes of binding. Fig 3B shows specific examples of these modes of binding. The left panel shows the direct binding mode: a TF recognizing its canonical motif. The middle panel shows that even though there is a strong PAX5 peak, the sequence signal is actually derived from a different factor, BATF, indicating either indirect PAX5 binding via a protein-protein interaction or potentially a distal looping interaction. This illustrates that for a region with ChIP-seq peaks for multiple factors; the binding signals need only come from a subset of those factors. Finally the right panel demonstrates co-binding of PAX5 and PU.1 with each factor recognizing its respective motif. These observations are consistent with the different modes of interaction between TFs identified by Wang et al. [8]. The fraction of peaks with non-canonical signal is dependent on the TF; we observed a continuous spectrum across ChIP-seq experiments, with some TFs showing exclusively canonical signals and others showing a mix of canonical and non-canonical signals (S3 Fig and S3 Table). We note that gkm-SVM has a procedure for producing PSSMs from the top ranked k-mers in the model and reports three motifs per TF ChIP-seq experiments. However, when we compared results with SeqGL and HOMER, we saw that gkm-SVM missed many of the co-factor signals identified by SeqGL and indeed often returned three variants of the same motif (S4 Table).

Bottom Line: Benchmarked on over 100 ChIP-seq experiments, SeqGL outperformed traditional motif discovery tools in discriminative accuracy.Furthermore, SeqGL can be naturally used with multitask learning to identify genomic and cell-type context determinants of TF binding.In particular, SeqGL was able to identify a number of ChIP-seq validated sequence signals that were not found by traditional motif discovery algorithms.

View Article: PubMed Central - PubMed

Affiliation: Computational Biology Program, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America.

ABSTRACT
Genome-wide maps of transcription factor (TF) occupancy and regions of open chromatin implicitly contain DNA sequence signals for multiple factors. We present SeqGL, a novel de novo motif discovery algorithm to identify multiple TF sequence signals from ChIP-, DNase-, and ATAC-seq profiles. SeqGL trains a discriminative model using a k-mer feature representation together with group lasso regularization to extract a collection of sequence signals that distinguish peak sequences from flanking regions. Benchmarked on over 100 ChIP-seq experiments, SeqGL outperformed traditional motif discovery tools in discriminative accuracy. Furthermore, SeqGL can be naturally used with multitask learning to identify genomic and cell-type context determinants of TF binding. SeqGL successfully scales to the large multiplicity of sequence signals in DNase- or ATAC-seq maps. In particular, SeqGL was able to identify a number of ChIP-seq validated sequence signals that were not found by traditional motif discovery algorithms. Thus compared to widely used motif discovery algorithms, SeqGL demonstrates both greater discriminative accuracy and higher sensitivity for detecting the DNA sequence signals underlying regulatory element maps. SeqGL is available at http://cbio.mskcc.org/public/Leslie/SeqGL/.

No MeSH data available.