Limits...
MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes.

Marinescu VD, Kohane IS, Riva A - BMC Bioinformatics (2005)

Bottom Line: In several cases tested the method identified correctly experimentally characterized sites, with better specificity and sensitivity than other similar computational methods.Moreover, a large-scale comparison using synthetic data showed that in the majority of cases our method performed significantly better than a nucleotide weight matrix-based method.Due to its extensive database of models, powerful search engine and flexible interface, MAPPER represents an effective resource for the large-scale computational analysis of transcriptional regulation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Children's Hospital Boston, Harvard Medical School, 300 Longwood Avenue, Boston, MA 02115, USA. vdmarinescu@chip.org

ABSTRACT

Background: Cis-regulatory modules are combinations of regulatory elements occurring in close proximity to each other that control the spatial and temporal expression of genes. The ability to identify them in a genome-wide manner depends on the availability of accurate models and of search methods able to detect putative regulatory elements with enhanced sensitivity and specificity.

Results: We describe the implementation of a search method for putative transcription factor binding sites (TFBSs) based on hidden Markov models built from alignments of known sites. We built 1,079 models of TFBSs using experimentally determined sequence alignments of sites provided by the TRANSFAC and JASPAR databases and used them to scan sequences of the human, mouse, fly, worm and yeast genomes. In several cases tested the method identified correctly experimentally characterized sites, with better specificity and sensitivity than other similar computational methods. Moreover, a large-scale comparison using synthetic data showed that in the majority of cases our method performed significantly better than a nucleotide weight matrix-based method.

Conclusion: The search engine, available at http://mapper.chip.org, allows the identification, visualization and selection of putative TFBSs occurring in the promoter or other regions of a gene from the human, mouse, fly, worm and yeast genomes. In addition it allows the user to upload a sequence to query and to build a model by supplying a multiple sequence alignment of binding sites for a transcription factor of interest. Due to its extensive database of models, powerful search engine and flexible interface, MAPPER represents an effective resource for the large-scale computational analysis of transcriptional regulation.

Show MeSH
Quality measures for the alignments retrieved. A. Distribution of the parameters characterizing the model (length, number of sequences and the size of the nucleotide matrix used to train the model). B. Distribution of the median and average quality of the nucleotide sequences used to build the alignments for the TRANSFAC factor-derived models. The quality variable is categorical and represents "1 – functionally confirmed factor binding site; 2 – binding of pure protein purified or recombinant, 3 – immunologically characterized binding activity of a cellular extract, 4 – binding activity characterized via a known binding sequence, 5 – binding of uncharacterized extract protein to a bona fide element, 6 – no quality assigned" (cf. TRANSFAC documentation).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC1131891&req=5

Figure 1: Quality measures for the alignments retrieved. A. Distribution of the parameters characterizing the model (length, number of sequences and the size of the nucleotide matrix used to train the model). B. Distribution of the median and average quality of the nucleotide sequences used to build the alignments for the TRANSFAC factor-derived models. The quality variable is categorical and represents "1 – functionally confirmed factor binding site; 2 – binding of pure protein purified or recombinant, 3 – immunologically characterized binding activity of a cellular extract, 4 – binding activity characterized via a known binding sequence, 5 – binding of uncharacterized extract protein to a bona fide element, 6 – no quality assigned" (cf. TRANSFAC documentation).

Mentions: HMMs were built using multiple sequence alignments of binding sites compiled from the TRANSFAC [20] and JASPAR [21] databases. TRANSFAC provides two sources of information regarding the binding sites for TFs: nucleotide sequences of binding sites referenced in the description of the TRANSFAC matrices that were optimally aligned and used to derive NWMs (designated below as matrix-derived alignments and catalogued with accession numbers starting with "M"), and nucleotide sequences of binding sites referenced as part of the description of the TFs – also referred to as "factors", used to extract alignments designated below as factor-derived alignments and catalogued with accession numbers starting with "T". By parsing the TRANSFAC flat files (see Methods for details) we obtained 402 alignments corresponding to matrices and 588 alignments corresponding to factor entries. In addition, 89 alignments were obtained from data downloaded from the JASPAR database. Thus, the total number of alignments used to build HMMs was equal to 1,079. Figure 1A shows the distribution of the length of the models and of the number of sequences and size of the nucleotide matrix used to train them. The models have an average length of 10 nucleotides and were trained on an average of 22 sequences. TRANSFAC assigns a quality value to the sites used to build the factor-derived models, based on the existing biological evidence of the binding (see Figure 1B legend for details). The distribution of the median and average quality of the sites used for the factor-derived models suggests that the large majority contains high quality sites (categorical values smaller than 4).


MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes.

Marinescu VD, Kohane IS, Riva A - BMC Bioinformatics (2005)

Quality measures for the alignments retrieved. A. Distribution of the parameters characterizing the model (length, number of sequences and the size of the nucleotide matrix used to train the model). B. Distribution of the median and average quality of the nucleotide sequences used to build the alignments for the TRANSFAC factor-derived models. The quality variable is categorical and represents "1 – functionally confirmed factor binding site; 2 – binding of pure protein purified or recombinant, 3 – immunologically characterized binding activity of a cellular extract, 4 – binding activity characterized via a known binding sequence, 5 – binding of uncharacterized extract protein to a bona fide element, 6 – no quality assigned" (cf. TRANSFAC documentation).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC1131891&req=5

Figure 1: Quality measures for the alignments retrieved. A. Distribution of the parameters characterizing the model (length, number of sequences and the size of the nucleotide matrix used to train the model). B. Distribution of the median and average quality of the nucleotide sequences used to build the alignments for the TRANSFAC factor-derived models. The quality variable is categorical and represents "1 – functionally confirmed factor binding site; 2 – binding of pure protein purified or recombinant, 3 – immunologically characterized binding activity of a cellular extract, 4 – binding activity characterized via a known binding sequence, 5 – binding of uncharacterized extract protein to a bona fide element, 6 – no quality assigned" (cf. TRANSFAC documentation).
Mentions: HMMs were built using multiple sequence alignments of binding sites compiled from the TRANSFAC [20] and JASPAR [21] databases. TRANSFAC provides two sources of information regarding the binding sites for TFs: nucleotide sequences of binding sites referenced in the description of the TRANSFAC matrices that were optimally aligned and used to derive NWMs (designated below as matrix-derived alignments and catalogued with accession numbers starting with "M"), and nucleotide sequences of binding sites referenced as part of the description of the TFs – also referred to as "factors", used to extract alignments designated below as factor-derived alignments and catalogued with accession numbers starting with "T". By parsing the TRANSFAC flat files (see Methods for details) we obtained 402 alignments corresponding to matrices and 588 alignments corresponding to factor entries. In addition, 89 alignments were obtained from data downloaded from the JASPAR database. Thus, the total number of alignments used to build HMMs was equal to 1,079. Figure 1A shows the distribution of the length of the models and of the number of sequences and size of the nucleotide matrix used to train them. The models have an average length of 10 nucleotides and were trained on an average of 22 sequences. TRANSFAC assigns a quality value to the sites used to build the factor-derived models, based on the existing biological evidence of the binding (see Figure 1B legend for details). The distribution of the median and average quality of the sites used for the factor-derived models suggests that the large majority contains high quality sites (categorical values smaller than 4).

Bottom Line: In several cases tested the method identified correctly experimentally characterized sites, with better specificity and sensitivity than other similar computational methods.Moreover, a large-scale comparison using synthetic data showed that in the majority of cases our method performed significantly better than a nucleotide weight matrix-based method.Due to its extensive database of models, powerful search engine and flexible interface, MAPPER represents an effective resource for the large-scale computational analysis of transcriptional regulation.

View Article: PubMed Central - HTML - PubMed

Affiliation: Children's Hospital Boston, Harvard Medical School, 300 Longwood Avenue, Boston, MA 02115, USA. vdmarinescu@chip.org

ABSTRACT

Background: Cis-regulatory modules are combinations of regulatory elements occurring in close proximity to each other that control the spatial and temporal expression of genes. The ability to identify them in a genome-wide manner depends on the availability of accurate models and of search methods able to detect putative regulatory elements with enhanced sensitivity and specificity.

Results: We describe the implementation of a search method for putative transcription factor binding sites (TFBSs) based on hidden Markov models built from alignments of known sites. We built 1,079 models of TFBSs using experimentally determined sequence alignments of sites provided by the TRANSFAC and JASPAR databases and used them to scan sequences of the human, mouse, fly, worm and yeast genomes. In several cases tested the method identified correctly experimentally characterized sites, with better specificity and sensitivity than other similar computational methods. Moreover, a large-scale comparison using synthetic data showed that in the majority of cases our method performed significantly better than a nucleotide weight matrix-based method.

Conclusion: The search engine, available at http://mapper.chip.org, allows the identification, visualization and selection of putative TFBSs occurring in the promoter or other regions of a gene from the human, mouse, fly, worm and yeast genomes. In addition it allows the user to upload a sequence to query and to build a model by supplying a multiple sequence alignment of binding sites for a transcription factor of interest. Due to its extensive database of models, powerful search engine and flexible interface, MAPPER represents an effective resource for the large-scale computational analysis of transcriptional regulation.

Show MeSH