Limits...
MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes.

Marinescu VD, Kohane IS, Riva A - BMC Bioinformatics (2005)

Bottom Line: In several cases tested the method identified correctly experimentally characterized sites, with better specificity and sensitivity than other similar computational methods.Moreover, a large-scale comparison using synthetic data showed that in the majority of cases our method performed significantly better than a nucleotide weight matrix-based method.Due to its extensive database of models, powerful search engine and flexible interface, MAPPER represents an effective resource for the large-scale computational analysis of transcriptional regulation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Children's Hospital Boston, Harvard Medical School, 300 Longwood Avenue, Boston, MA 02115, USA. vdmarinescu@chip.org

ABSTRACT

Background: Cis-regulatory modules are combinations of regulatory elements occurring in close proximity to each other that control the spatial and temporal expression of genes. The ability to identify them in a genome-wide manner depends on the availability of accurate models and of search methods able to detect putative regulatory elements with enhanced sensitivity and specificity.

Results: We describe the implementation of a search method for putative transcription factor binding sites (TFBSs) based on hidden Markov models built from alignments of known sites. We built 1,079 models of TFBSs using experimentally determined sequence alignments of sites provided by the TRANSFAC and JASPAR databases and used them to scan sequences of the human, mouse, fly, worm and yeast genomes. In several cases tested the method identified correctly experimentally characterized sites, with better specificity and sensitivity than other similar computational methods. Moreover, a large-scale comparison using synthetic data showed that in the majority of cases our method performed significantly better than a nucleotide weight matrix-based method.

Conclusion: The search engine, available at http://mapper.chip.org, allows the identification, visualization and selection of putative TFBSs occurring in the promoter or other regions of a gene from the human, mouse, fly, worm and yeast genomes. In addition it allows the user to upload a sequence to query and to build a model by supplying a multiple sequence alignment of binding sites for a transcription factor of interest. Due to its extensive database of models, powerful search engine and flexible interface, MAPPER represents an effective resource for the large-scale computational analysis of transcriptional regulation.

Show MeSH
Different representations of the set of putative TFBSs in the human MCM5 gene promoter. A. Graphical representation of the hit set presented in Figure 3. B. The hit set was exported to the UCSC Human Genome Browser as a custom track. The region displayed in this image extends to 500 bp upstream of the coding sequence start. Note that the clusters of predicted binding sites correspond to peaks in the human/mouse conservation track at the bottom, suggesting that those regions are functional. The positions of the most conserved elements displayed in the conservation track are the ones used in the previous page to highlight hits in evolutionary conserved regions (see Methods for details).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1131891&req=5

Figure 4: Different representations of the set of putative TFBSs in the human MCM5 gene promoter. A. Graphical representation of the hit set presented in Figure 3. B. The hit set was exported to the UCSC Human Genome Browser as a custom track. The region displayed in this image extends to 500 bp upstream of the coding sequence start. Note that the clusters of predicted binding sites correspond to peaks in the human/mouse conservation track at the bottom, suggesting that those regions are functional. The positions of the most conserved elements displayed in the conservation track are the ones used in the previous page to highlight hits in evolutionary conserved regions (see Methods for details).

Mentions: The user can choose to display all hits for a given sequence, or only the hits for factors that are common across the orthologs of that sequence (if present in the HomoloGene database). The output of the system is the list of putative hits found in the specified conditions – default score and E-value thresholds are 0 and 10 respectively (Figure 3). For each hit the system displays the model used to retrieve it, its location, score and E-value and (in a pop-up window) the alignment between the model and the sequence at that site. The hit set can be sorted by position (from the ATG or the start of the transcript for genes supplied via an identifier, or from the beginning of the sequence for FastA sequences), by model name, model accession, score or E-value. In addition the results page highlights adjacent sites (situated within 50 bp from each other) retrieved for TFs that are known to physically interact with each other (as annotated in the TRANSFAC database), and also the classes of TFs for which putative sites were found. For each TRANSFAC factor-derived or JASPAR-derived model the organism in which the factor was described is specified. The user can choose to highlight hits in evolutionarily conserved regions representing the most conserved elements between sets of organisms provided by the UCSC Genome Browser annotations (see Methods for details). Figure 4A shows a graphical representation of the position and orientation of the hits listed in Figure 3. Arrows are drawn to scale and in some cases represent the sum of overlapping sites for the different transcription factors that are listed above or beneath them. Hits occurring in evolutionarily conserved regions are displayed in red. While the results page in Figure 3 lists and sorts all putative TFBSs independently from each other, the graphical representation in Figure 4A makes it easy to identify regions in which more than one model (corresponding to the same factor or to closely related factors) detects a binding site making it, therefore, more likely to represent a true binding site. The set of hits can also be exported and displayed in the UCSC Genome Browser using its "custom tracks" feature (Figure 4B). This allows the user to view the TFBSs in the general context of the genomic region in which they appear and to take advantage of the powerful visualization tools of the UCSC Genome Browser in order to highlight important features of the genomic region.


MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes.

Marinescu VD, Kohane IS, Riva A - BMC Bioinformatics (2005)

Different representations of the set of putative TFBSs in the human MCM5 gene promoter. A. Graphical representation of the hit set presented in Figure 3. B. The hit set was exported to the UCSC Human Genome Browser as a custom track. The region displayed in this image extends to 500 bp upstream of the coding sequence start. Note that the clusters of predicted binding sites correspond to peaks in the human/mouse conservation track at the bottom, suggesting that those regions are functional. The positions of the most conserved elements displayed in the conservation track are the ones used in the previous page to highlight hits in evolutionary conserved regions (see Methods for details).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1131891&req=5

Figure 4: Different representations of the set of putative TFBSs in the human MCM5 gene promoter. A. Graphical representation of the hit set presented in Figure 3. B. The hit set was exported to the UCSC Human Genome Browser as a custom track. The region displayed in this image extends to 500 bp upstream of the coding sequence start. Note that the clusters of predicted binding sites correspond to peaks in the human/mouse conservation track at the bottom, suggesting that those regions are functional. The positions of the most conserved elements displayed in the conservation track are the ones used in the previous page to highlight hits in evolutionary conserved regions (see Methods for details).
Mentions: The user can choose to display all hits for a given sequence, or only the hits for factors that are common across the orthologs of that sequence (if present in the HomoloGene database). The output of the system is the list of putative hits found in the specified conditions – default score and E-value thresholds are 0 and 10 respectively (Figure 3). For each hit the system displays the model used to retrieve it, its location, score and E-value and (in a pop-up window) the alignment between the model and the sequence at that site. The hit set can be sorted by position (from the ATG or the start of the transcript for genes supplied via an identifier, or from the beginning of the sequence for FastA sequences), by model name, model accession, score or E-value. In addition the results page highlights adjacent sites (situated within 50 bp from each other) retrieved for TFs that are known to physically interact with each other (as annotated in the TRANSFAC database), and also the classes of TFs for which putative sites were found. For each TRANSFAC factor-derived or JASPAR-derived model the organism in which the factor was described is specified. The user can choose to highlight hits in evolutionarily conserved regions representing the most conserved elements between sets of organisms provided by the UCSC Genome Browser annotations (see Methods for details). Figure 4A shows a graphical representation of the position and orientation of the hits listed in Figure 3. Arrows are drawn to scale and in some cases represent the sum of overlapping sites for the different transcription factors that are listed above or beneath them. Hits occurring in evolutionarily conserved regions are displayed in red. While the results page in Figure 3 lists and sorts all putative TFBSs independently from each other, the graphical representation in Figure 4A makes it easy to identify regions in which more than one model (corresponding to the same factor or to closely related factors) detects a binding site making it, therefore, more likely to represent a true binding site. The set of hits can also be exported and displayed in the UCSC Genome Browser using its "custom tracks" feature (Figure 4B). This allows the user to view the TFBSs in the general context of the genomic region in which they appear and to take advantage of the powerful visualization tools of the UCSC Genome Browser in order to highlight important features of the genomic region.

Bottom Line: In several cases tested the method identified correctly experimentally characterized sites, with better specificity and sensitivity than other similar computational methods.Moreover, a large-scale comparison using synthetic data showed that in the majority of cases our method performed significantly better than a nucleotide weight matrix-based method.Due to its extensive database of models, powerful search engine and flexible interface, MAPPER represents an effective resource for the large-scale computational analysis of transcriptional regulation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Children's Hospital Boston, Harvard Medical School, 300 Longwood Avenue, Boston, MA 02115, USA. vdmarinescu@chip.org

ABSTRACT

Background: Cis-regulatory modules are combinations of regulatory elements occurring in close proximity to each other that control the spatial and temporal expression of genes. The ability to identify them in a genome-wide manner depends on the availability of accurate models and of search methods able to detect putative regulatory elements with enhanced sensitivity and specificity.

Results: We describe the implementation of a search method for putative transcription factor binding sites (TFBSs) based on hidden Markov models built from alignments of known sites. We built 1,079 models of TFBSs using experimentally determined sequence alignments of sites provided by the TRANSFAC and JASPAR databases and used them to scan sequences of the human, mouse, fly, worm and yeast genomes. In several cases tested the method identified correctly experimentally characterized sites, with better specificity and sensitivity than other similar computational methods. Moreover, a large-scale comparison using synthetic data showed that in the majority of cases our method performed significantly better than a nucleotide weight matrix-based method.

Conclusion: The search engine, available at http://mapper.chip.org, allows the identification, visualization and selection of putative TFBSs occurring in the promoter or other regions of a gene from the human, mouse, fly, worm and yeast genomes. In addition it allows the user to upload a sequence to query and to build a model by supplying a multiple sequence alignment of binding sites for a transcription factor of interest. Due to its extensive database of models, powerful search engine and flexible interface, MAPPER represents an effective resource for the large-scale computational analysis of transcriptional regulation.

Show MeSH