Limits...
WeederH: an algorithm for finding conserved regulatory motifs and regions in homologous sequences.

Pavesi G, Zambelli F, Pesole G - BMC Bioinformatics (2007)

Bottom Line: This work addresses the problem of detecting conserved transcription factor binding sites and in general regulatory regions through the analysis of sequences from homologous genes, an approach that is becoming more and more widely used given the ever increasing amount of genomic data available.Differently from the most commonly used methods, the approach we present does not need or compute an alignment of the sequences investigated, nor resorts to descriptors of the binding specificity of known transcription factors.The main novel idea we introduce is a relative measure of conservation, assuming that true functional elements should present a higher level of conservation with respect to the rest of the sequence surrounding them.

View Article: PubMed Central - HTML - PubMed

Affiliation: Dipartimento di Scienze Biomolecolari e Biotecnologie, University of Milan, Milan, Italy. giulio.pavesi@unimi.it

ABSTRACT

Background: This work addresses the problem of detecting conserved transcription factor binding sites and in general regulatory regions through the analysis of sequences from homologous genes, an approach that is becoming more and more widely used given the ever increasing amount of genomic data available.

Results: We present an algorithm that identifies conserved transcription factor binding sites in a given sequence by comparing it to one or more homologs, adapting a framework we previously introduced for the discovery of sites in sequences from co-regulated genes. Differently from the most commonly used methods, the approach we present does not need or compute an alignment of the sequences investigated, nor resorts to descriptors of the binding specificity of known transcription factors. The main novel idea we introduce is a relative measure of conservation, assuming that true functional elements should present a higher level of conservation with respect to the rest of the sequence surrounding them. We present tests where we applied the algorithm to the identification of conserved annotated sites in homologous promoters, as well as in distal regions like enhancers.

Conclusion: Results of the tests show how the algorithm can provide fast and reliable predictions of conserved transcription factor binding sites regulating the transcription of a gene, with better performances than other available methods for the same task. We also show examples on how the algorithm can be successfully employed when promoter annotations of the genes investigated are missing, or when regulatory sites and regions are located far away from the genes.

Show MeSH
WeederH output on the promoter and 5'UTR of the MYB gene (as defined in the ABS database). Highest scoring motif output by WeederH (top track), ABS annotated site (in green), sequence of the ABS database (S66422 – aligned with BLAT), and conserved regions predicted by RP score [27], light blue) and phastCons ([45], bottom track). Notice how both the latter methods predict conserved long regions spanning most of the promoter itself, making difficult the identification of single conserved TFBSs.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1803799&req=5

Figure 5: WeederH output on the promoter and 5'UTR of the MYB gene (as defined in the ABS database). Highest scoring motif output by WeederH (top track), ABS annotated site (in green), sequence of the ABS database (S66422 – aligned with BLAT), and conserved regions predicted by RP score [27], light blue) and phastCons ([45], bottom track). Notice how both the latter methods predict conserved long regions spanning most of the promoter itself, making difficult the identification of single conserved TFBSs.

Mentions: We also compared our results to the phastCons annotations [28,45] available at the UCSC genome browser [13]. The overall percentage of the reference sequences used in this test annotated as "most conserved" is around 25%, and about 55% of the sites annotated in the database are covered (50% at the nucleotide level). Although the tracks are generated by comparing human to all the vertebrate genomic sequences available, instead of rodents alone (and vice versa), this result nevertheless highlights the fact that methods like phastCons are better suited to identify large conserved regions rather than single sites. Examples are shown in Figures 4 and 5, with the results of WeederH compared to the conserved regions predicted by phastCons and RP [27] available at the UCSC genome browser. Indeed, these examples show typical situations in which the two methods either do not detect any conserved element in the core promoter because single TFBSs are too small to reach a significant level of conservation (Figure 4), or can single out only large conserved regions (Figure 5, as also in the p53 example, see Figure 1). This is also a drawback deriving from the usage of a single, global threshold of significance. These examples show how WeederH can work at a much more fine grained level of detail, actually being able to identify correctly conserved TFBSs either with little or, vice versa, a very high level of overall sequence conservation.


WeederH: an algorithm for finding conserved regulatory motifs and regions in homologous sequences.

Pavesi G, Zambelli F, Pesole G - BMC Bioinformatics (2007)

WeederH output on the promoter and 5'UTR of the MYB gene (as defined in the ABS database). Highest scoring motif output by WeederH (top track), ABS annotated site (in green), sequence of the ABS database (S66422 – aligned with BLAT), and conserved regions predicted by RP score [27], light blue) and phastCons ([45], bottom track). Notice how both the latter methods predict conserved long regions spanning most of the promoter itself, making difficult the identification of single conserved TFBSs.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1803799&req=5

Figure 5: WeederH output on the promoter and 5'UTR of the MYB gene (as defined in the ABS database). Highest scoring motif output by WeederH (top track), ABS annotated site (in green), sequence of the ABS database (S66422 – aligned with BLAT), and conserved regions predicted by RP score [27], light blue) and phastCons ([45], bottom track). Notice how both the latter methods predict conserved long regions spanning most of the promoter itself, making difficult the identification of single conserved TFBSs.
Mentions: We also compared our results to the phastCons annotations [28,45] available at the UCSC genome browser [13]. The overall percentage of the reference sequences used in this test annotated as "most conserved" is around 25%, and about 55% of the sites annotated in the database are covered (50% at the nucleotide level). Although the tracks are generated by comparing human to all the vertebrate genomic sequences available, instead of rodents alone (and vice versa), this result nevertheless highlights the fact that methods like phastCons are better suited to identify large conserved regions rather than single sites. Examples are shown in Figures 4 and 5, with the results of WeederH compared to the conserved regions predicted by phastCons and RP [27] available at the UCSC genome browser. Indeed, these examples show typical situations in which the two methods either do not detect any conserved element in the core promoter because single TFBSs are too small to reach a significant level of conservation (Figure 4), or can single out only large conserved regions (Figure 5, as also in the p53 example, see Figure 1). This is also a drawback deriving from the usage of a single, global threshold of significance. These examples show how WeederH can work at a much more fine grained level of detail, actually being able to identify correctly conserved TFBSs either with little or, vice versa, a very high level of overall sequence conservation.

Bottom Line: This work addresses the problem of detecting conserved transcription factor binding sites and in general regulatory regions through the analysis of sequences from homologous genes, an approach that is becoming more and more widely used given the ever increasing amount of genomic data available.Differently from the most commonly used methods, the approach we present does not need or compute an alignment of the sequences investigated, nor resorts to descriptors of the binding specificity of known transcription factors.The main novel idea we introduce is a relative measure of conservation, assuming that true functional elements should present a higher level of conservation with respect to the rest of the sequence surrounding them.

View Article: PubMed Central - HTML - PubMed

Affiliation: Dipartimento di Scienze Biomolecolari e Biotecnologie, University of Milan, Milan, Italy. giulio.pavesi@unimi.it

ABSTRACT

Background: This work addresses the problem of detecting conserved transcription factor binding sites and in general regulatory regions through the analysis of sequences from homologous genes, an approach that is becoming more and more widely used given the ever increasing amount of genomic data available.

Results: We present an algorithm that identifies conserved transcription factor binding sites in a given sequence by comparing it to one or more homologs, adapting a framework we previously introduced for the discovery of sites in sequences from co-regulated genes. Differently from the most commonly used methods, the approach we present does not need or compute an alignment of the sequences investigated, nor resorts to descriptors of the binding specificity of known transcription factors. The main novel idea we introduce is a relative measure of conservation, assuming that true functional elements should present a higher level of conservation with respect to the rest of the sequence surrounding them. We present tests where we applied the algorithm to the identification of conserved annotated sites in homologous promoters, as well as in distal regions like enhancers.

Conclusion: Results of the tests show how the algorithm can provide fast and reliable predictions of conserved transcription factor binding sites regulating the transcription of a gene, with better performances than other available methods for the same task. We also show examples on how the algorithm can be successfully employed when promoter annotations of the genes investigated are missing, or when regulatory sites and regions are located far away from the genes.

Show MeSH