Limits...
Exploiting ancestral mammalian genomes for the prediction of human transcription factor binding sites.

Blanchette M - BMC Bioinformatics (2012)

Bottom Line: We propose a TFBS prediction approach that makes use of the availability of inferred ancestral mammalian genomes to improve its accuracy.After proposing a neutral evolutionary model of predicted TFBS counts in a DNA region of a given length, we use it to identify regions that have preserved the number of predicted TFBS they contain to an unexpected degree given their divergence.The approach is applied to human chromosome 1 and shows significant gains in accuracy as compared to both existing single-species and multi-species TFBS prediction approaches, in particular for transcription factors that are subject to high turnover rates.

View Article: PubMed Central - HTML - PubMed

Affiliation: McGill Centre for Bioinformatics and School of Computer Science, McGill University, H3C 2B4 Québec, Canada. blanchem@cs.mcgill.ca

ABSTRACT

Background: The computational prediction of Transcription Factor Binding Sites (TFBS) remains a challenge due to their short length and low information content. Comparative genomics approaches that simultaneously consider several related species and favor sites that have been conserved throughout evolution improve the accuracy (specificity) of the predictions but are limited due to a phenomenon called binding site turnover, where sequence evolution causes one TFBS to replace another in the same region. In parallel to this development, an increasing number of mammalian genomes are now sequenced and it is becoming possible to infer, to a surprisingly high degree of accuracy, ancestral mammalian sequences.

Results: We propose a TFBS prediction approach that makes use of the availability of inferred ancestral mammalian genomes to improve its accuracy. This method aims to identify binding loci, which are regions of a few hundred base pairs that have preserved their potential to bind a given transcription factor over evolutionary time. After proposing a neutral evolutionary model of predicted TFBS counts in a DNA region of a given length, we use it to identify regions that have preserved the number of predicted TFBS they contain to an unexpected degree given their divergence. The approach is applied to human chromosome 1 and shows significant gains in accuracy as compared to both existing single-species and multi-species TFBS prediction approaches, in particular for transcription factors that are subject to high turnover rates.

Availability: The source code and predictions made by the program are available at http://www.cs.mcgill.ca/~blanchem/bindingLoci.

Show MeSH
Examples of the distribution of the number of predicted TFBS in the descendant of a 200-bp ancestral sequence containing two predicted sites that diverge for various durations, for a TF with a relatively high information content matrix (SRF; left) and one with a relatively low information content (GATA-1; right). At low divergence values, the descendant sequence still contains two sites with high probability. As the level of divergence increases, the distribution of the number of predicted sites converges toward a different stationary distribution for each TF.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3526440&req=5

Figure 1: Examples of the distribution of the number of predicted TFBS in the descendant of a 200-bp ancestral sequence containing two predicted sites that diverge for various durations, for a TF with a relatively high information content matrix (SRF; left) and one with a relatively low information content (GATA-1; right). At low divergence values, the descendant sequence still contains two sites with high probability. As the level of divergence increases, the distribution of the number of predicted sites converges toward a different stationary distribution for each TF.

Mentions: which measures the probability that a random sequence containing B(Gp(u)(W), T) predicted binding sites and evolving neutrally for δ = λ(Gp(u)(W), Gu(W)) time will have preserved its number of binding sites to at least the same degree as has been observed in the actual pair of sequences. Tree branches with small p-values correspond to branches where the observed variation in TFBS counts is lower than expected for neutrally evolving sequences, suggesting that natural selection may be at work preserving the predicted sites. However, as shown in Figure 1, p-values obtained for individual tree branches are rarely sufficiently small to reliably suggest deviations from the neutral evolution model. Instead, we combine the p-values obtained over the branches of the whole tree to obtain a global binding locus score for the region W and TF T under consideration:


Exploiting ancestral mammalian genomes for the prediction of human transcription factor binding sites.

Blanchette M - BMC Bioinformatics (2012)

Examples of the distribution of the number of predicted TFBS in the descendant of a 200-bp ancestral sequence containing two predicted sites that diverge for various durations, for a TF with a relatively high information content matrix (SRF; left) and one with a relatively low information content (GATA-1; right). At low divergence values, the descendant sequence still contains two sites with high probability. As the level of divergence increases, the distribution of the number of predicted sites converges toward a different stationary distribution for each TF.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3526440&req=5

Figure 1: Examples of the distribution of the number of predicted TFBS in the descendant of a 200-bp ancestral sequence containing two predicted sites that diverge for various durations, for a TF with a relatively high information content matrix (SRF; left) and one with a relatively low information content (GATA-1; right). At low divergence values, the descendant sequence still contains two sites with high probability. As the level of divergence increases, the distribution of the number of predicted sites converges toward a different stationary distribution for each TF.
Mentions: which measures the probability that a random sequence containing B(Gp(u)(W), T) predicted binding sites and evolving neutrally for δ = λ(Gp(u)(W), Gu(W)) time will have preserved its number of binding sites to at least the same degree as has been observed in the actual pair of sequences. Tree branches with small p-values correspond to branches where the observed variation in TFBS counts is lower than expected for neutrally evolving sequences, suggesting that natural selection may be at work preserving the predicted sites. However, as shown in Figure 1, p-values obtained for individual tree branches are rarely sufficiently small to reliably suggest deviations from the neutral evolution model. Instead, we combine the p-values obtained over the branches of the whole tree to obtain a global binding locus score for the region W and TF T under consideration:

Bottom Line: We propose a TFBS prediction approach that makes use of the availability of inferred ancestral mammalian genomes to improve its accuracy.After proposing a neutral evolutionary model of predicted TFBS counts in a DNA region of a given length, we use it to identify regions that have preserved the number of predicted TFBS they contain to an unexpected degree given their divergence.The approach is applied to human chromosome 1 and shows significant gains in accuracy as compared to both existing single-species and multi-species TFBS prediction approaches, in particular for transcription factors that are subject to high turnover rates.

View Article: PubMed Central - HTML - PubMed

Affiliation: McGill Centre for Bioinformatics and School of Computer Science, McGill University, H3C 2B4 Québec, Canada. blanchem@cs.mcgill.ca

ABSTRACT

Background: The computational prediction of Transcription Factor Binding Sites (TFBS) remains a challenge due to their short length and low information content. Comparative genomics approaches that simultaneously consider several related species and favor sites that have been conserved throughout evolution improve the accuracy (specificity) of the predictions but are limited due to a phenomenon called binding site turnover, where sequence evolution causes one TFBS to replace another in the same region. In parallel to this development, an increasing number of mammalian genomes are now sequenced and it is becoming possible to infer, to a surprisingly high degree of accuracy, ancestral mammalian sequences.

Results: We propose a TFBS prediction approach that makes use of the availability of inferred ancestral mammalian genomes to improve its accuracy. This method aims to identify binding loci, which are regions of a few hundred base pairs that have preserved their potential to bind a given transcription factor over evolutionary time. After proposing a neutral evolutionary model of predicted TFBS counts in a DNA region of a given length, we use it to identify regions that have preserved the number of predicted TFBS they contain to an unexpected degree given their divergence. The approach is applied to human chromosome 1 and shows significant gains in accuracy as compared to both existing single-species and multi-species TFBS prediction approaches, in particular for transcription factors that are subject to high turnover rates.

Availability: The source code and predictions made by the program are available at http://www.cs.mcgill.ca/~blanchem/bindingLoci.

Show MeSH