Limits...
Functional characterization of transcription factor motifs using cross-species comparison across large evolutionary distances.

Kim J, Cunningham R, James B, Wyder S, Gibson JD, Niehuis O, Zdobnov EM, Robertson HM, Robinson GE, Werren JH, Sinha S - PLoS Comput. Biol. (2010)

Bottom Line: We develop a computational framework for this task, whose features include a new statistical score for motif scanning, the use of different scores for predicting targets of different motifs, and new ways to deal with redundancies among significant motif-function associations.It is therefore well suited for comparative genomics across large evolutionary divergences, where existing alignment-based methods are not applicable.We also apply the framework to find motifs associated with socially regulated gene sets in the honeybee, Apis mellifera, using comparisons with Nasonia, a solitary species, to identify honeybee-specific associations.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America.

ABSTRACT
We address the problem of finding statistically significant associations between cis-regulatory motifs and functional gene sets, in order to understand the biological roles of transcription factors. We develop a computational framework for this task, whose features include a new statistical score for motif scanning, the use of different scores for predicting targets of different motifs, and new ways to deal with redundancies among significant motif-function associations. This framework is applied to the recently sequenced genome of the jewel wasp, Nasonia vitripennis, making use of the existing knowledge of motifs and gene annotations in another insect genome, that of the fruitfly. The framework uses cross-species comparison to improve the specificity of its predictions, and does so without relying upon non-coding sequence alignment. It is therefore well suited for comparative genomics across large evolutionary divergences, where existing alignment-based methods are not applicable. We also apply the framework to find motifs associated with socially regulated gene sets in the honeybee, Apis mellifera, using comparisons with Nasonia, a solitary species, to identify honeybee-specific associations.

Show MeSH

Related in: MedlinePlus

Performance of predicted motif – GO associations using cross-species comparison, evaluated based on ChIP-based binding data.The prediction performance is shown as the precision of the top 5 and top 10 predictions per motif (“PrecAt5” and “PrecAt10”, respectively), the precision at a significance threshold (p-value) of 0.005 (“PrecAtPval0.005”), and as the point where precision equals recall (“PrecEqRecall”). Three different levels of significance (“cutoff E-value” 0.001 (A, D), 0.01 (B, E), and 0.05 (C, F)) were used to define the set of true associations, and the effect of cross-species comparison on the Nasonia (A–C) and Drosophila (D–F) motif function maps were reported separately.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2813253&req=5

pcbi-1000652-g004: Performance of predicted motif – GO associations using cross-species comparison, evaluated based on ChIP-based binding data.The prediction performance is shown as the precision of the top 5 and top 10 predictions per motif (“PrecAt5” and “PrecAt10”, respectively), the precision at a significance threshold (p-value) of 0.005 (“PrecAtPval0.005”), and as the point where precision equals recall (“PrecEqRecall”). Three different levels of significance (“cutoff E-value” 0.001 (A, D), 0.01 (B, E), and 0.05 (C, F)) were used to define the set of true associations, and the effect of cross-species comparison on the Nasonia (A–C) and Drosophila (D–F) motif function maps were reported separately.

Mentions: To assess the advantage of this strategy, we constructed a benchmark of highly reliable motif – function associations that were based on chromatin immunoprecipitation (ChIP)-based TF occupancy data. We roughly followed the methodology of Boden and Bailey [39], where a “gold standard” of TF – GO associations was constructed for yeast and human. We started by compiling 13 data sets of ChIP-based binding data in Drosophila, corresponding to 10 distinct TFs. We used the respective author-defined TF target gene sets, and compiled the GO terms enriched in these target sets at three different levels of significance (E-value 0.05, 0.01, 0.001, see Methods). These TF – GO associations were treated as the benchmark of “true” motif – function associations that our pipeline would try to predict, either in its single species version, or by exploiting cross-species comparison. To examine the effect of the species with which comparisons are made, we included the genomes of Apis mellifera, Tribolium castaneum, Anopheles gambiae, and Drosophila virilis, in addition to Drosophila melanogaster, as separate evolutionary filters for the Nasonia motif function map. The performance of these predictions is shown in Figure 4, as the precision of the top 5 and top 10 predictions per motif (“PrecAt5” and “PrecAt10”, respectively), the precision at a fixed significance threshold (p-value of 0.005) (“PrecAtPval0.005”), and as the point where precision equals recall (“PrecEqRecall”). Here, “precision” is the fraction of predicted associations that are “true” and recall is the fraction of “true” associations that were predicted as being significant. We note that the performance (by all measures) improves substantially in going from Nasonia (single species) to pairwise comparison-based predictions, the only (minor) exception being the “PrecAtPval0.005” measure for Nasonia – Tribolium comparisons (Figure 4A–C). The improvement is most pronounced for Nasonia – Drosophila comparisons (e.g., “PrecAt5” improves from 0.2 to 0.36), presumably due to the benchmark being from Drosophila. We also note that in these large divergence comparisons, the actual evolutionary distance from Nasonia (e.g., ∼180 Myrs for Apis and ∼300 Myrs for Anopheles) does not make a significant difference in performance, except for the “PrecAtPval0.005” measure that is substantially more improved with Apis comparisons than with Tribolium or Anopheles comparisons. The effect of cross-species comparison on the Drosophila motif function map (Figure 4D–F) shows a slightly different trend. The precision consistently improves in going from Drosophila melanogaster (single species) to Drosophila melanogaster – Drosophila virilis comparison-based predictions, although the recall drops (“PrecEqRecall” remains at 0.30 in either case). However, comparison with largely diverged species such as Anopheles, Tribolium, Nasonia and Apis suffers both in precision and recall, again with the exception of the “PrecAtPval0.005” measure which conveys a mixed message. Finally, we observe that single species predictions are substantially better in Drosophila than in Nasonia, which is expected since (a) the benchmark associations are derived from Drosophila and may not be biologically “true” in Nasonia, and (b) the pipeline's application to Nasonia uses motif and GO data from Drosophila. The above trends, and particularly the improvement in precision through the use of cross-species comparison, were also confirmed with a second benchmark that we constructed based on bona fide TF binding sites from the REDfly database [25]. (See Methods and Figure S3.)


Functional characterization of transcription factor motifs using cross-species comparison across large evolutionary distances.

Kim J, Cunningham R, James B, Wyder S, Gibson JD, Niehuis O, Zdobnov EM, Robertson HM, Robinson GE, Werren JH, Sinha S - PLoS Comput. Biol. (2010)

Performance of predicted motif – GO associations using cross-species comparison, evaluated based on ChIP-based binding data.The prediction performance is shown as the precision of the top 5 and top 10 predictions per motif (“PrecAt5” and “PrecAt10”, respectively), the precision at a significance threshold (p-value) of 0.005 (“PrecAtPval0.005”), and as the point where precision equals recall (“PrecEqRecall”). Three different levels of significance (“cutoff E-value” 0.001 (A, D), 0.01 (B, E), and 0.05 (C, F)) were used to define the set of true associations, and the effect of cross-species comparison on the Nasonia (A–C) and Drosophila (D–F) motif function maps were reported separately.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2813253&req=5

pcbi-1000652-g004: Performance of predicted motif – GO associations using cross-species comparison, evaluated based on ChIP-based binding data.The prediction performance is shown as the precision of the top 5 and top 10 predictions per motif (“PrecAt5” and “PrecAt10”, respectively), the precision at a significance threshold (p-value) of 0.005 (“PrecAtPval0.005”), and as the point where precision equals recall (“PrecEqRecall”). Three different levels of significance (“cutoff E-value” 0.001 (A, D), 0.01 (B, E), and 0.05 (C, F)) were used to define the set of true associations, and the effect of cross-species comparison on the Nasonia (A–C) and Drosophila (D–F) motif function maps were reported separately.
Mentions: To assess the advantage of this strategy, we constructed a benchmark of highly reliable motif – function associations that were based on chromatin immunoprecipitation (ChIP)-based TF occupancy data. We roughly followed the methodology of Boden and Bailey [39], where a “gold standard” of TF – GO associations was constructed for yeast and human. We started by compiling 13 data sets of ChIP-based binding data in Drosophila, corresponding to 10 distinct TFs. We used the respective author-defined TF target gene sets, and compiled the GO terms enriched in these target sets at three different levels of significance (E-value 0.05, 0.01, 0.001, see Methods). These TF – GO associations were treated as the benchmark of “true” motif – function associations that our pipeline would try to predict, either in its single species version, or by exploiting cross-species comparison. To examine the effect of the species with which comparisons are made, we included the genomes of Apis mellifera, Tribolium castaneum, Anopheles gambiae, and Drosophila virilis, in addition to Drosophila melanogaster, as separate evolutionary filters for the Nasonia motif function map. The performance of these predictions is shown in Figure 4, as the precision of the top 5 and top 10 predictions per motif (“PrecAt5” and “PrecAt10”, respectively), the precision at a fixed significance threshold (p-value of 0.005) (“PrecAtPval0.005”), and as the point where precision equals recall (“PrecEqRecall”). Here, “precision” is the fraction of predicted associations that are “true” and recall is the fraction of “true” associations that were predicted as being significant. We note that the performance (by all measures) improves substantially in going from Nasonia (single species) to pairwise comparison-based predictions, the only (minor) exception being the “PrecAtPval0.005” measure for Nasonia – Tribolium comparisons (Figure 4A–C). The improvement is most pronounced for Nasonia – Drosophila comparisons (e.g., “PrecAt5” improves from 0.2 to 0.36), presumably due to the benchmark being from Drosophila. We also note that in these large divergence comparisons, the actual evolutionary distance from Nasonia (e.g., ∼180 Myrs for Apis and ∼300 Myrs for Anopheles) does not make a significant difference in performance, except for the “PrecAtPval0.005” measure that is substantially more improved with Apis comparisons than with Tribolium or Anopheles comparisons. The effect of cross-species comparison on the Drosophila motif function map (Figure 4D–F) shows a slightly different trend. The precision consistently improves in going from Drosophila melanogaster (single species) to Drosophila melanogaster – Drosophila virilis comparison-based predictions, although the recall drops (“PrecEqRecall” remains at 0.30 in either case). However, comparison with largely diverged species such as Anopheles, Tribolium, Nasonia and Apis suffers both in precision and recall, again with the exception of the “PrecAtPval0.005” measure which conveys a mixed message. Finally, we observe that single species predictions are substantially better in Drosophila than in Nasonia, which is expected since (a) the benchmark associations are derived from Drosophila and may not be biologically “true” in Nasonia, and (b) the pipeline's application to Nasonia uses motif and GO data from Drosophila. The above trends, and particularly the improvement in precision through the use of cross-species comparison, were also confirmed with a second benchmark that we constructed based on bona fide TF binding sites from the REDfly database [25]. (See Methods and Figure S3.)

Bottom Line: We develop a computational framework for this task, whose features include a new statistical score for motif scanning, the use of different scores for predicting targets of different motifs, and new ways to deal with redundancies among significant motif-function associations.It is therefore well suited for comparative genomics across large evolutionary divergences, where existing alignment-based methods are not applicable.We also apply the framework to find motifs associated with socially regulated gene sets in the honeybee, Apis mellifera, using comparisons with Nasonia, a solitary species, to identify honeybee-specific associations.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America.

ABSTRACT
We address the problem of finding statistically significant associations between cis-regulatory motifs and functional gene sets, in order to understand the biological roles of transcription factors. We develop a computational framework for this task, whose features include a new statistical score for motif scanning, the use of different scores for predicting targets of different motifs, and new ways to deal with redundancies among significant motif-function associations. This framework is applied to the recently sequenced genome of the jewel wasp, Nasonia vitripennis, making use of the existing knowledge of motifs and gene annotations in another insect genome, that of the fruitfly. The framework uses cross-species comparison to improve the specificity of its predictions, and does so without relying upon non-coding sequence alignment. It is therefore well suited for comparative genomics across large evolutionary divergences, where existing alignment-based methods are not applicable. We also apply the framework to find motifs associated with socially regulated gene sets in the honeybee, Apis mellifera, using comparisons with Nasonia, a solitary species, to identify honeybee-specific associations.

Show MeSH
Related in: MedlinePlus