Limits...
Predicting site-specific human selective pressure using evolutionary signatures.

Sadri J, Diallo AB, Blanchette M - Bioinformatics (2011)

Bottom Line: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate.We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites.Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science, McGill University, 3630 University, Montreal, QC, Canada H3A 2B2.

ABSTRACT

Motivation: The identification of non-coding functional regions of the human genome remains one of the main challenges of genomics. By observing how a given region evolved over time, one can detect signs of negative or positive selection hinting that the region may be functional. With the quickly increasing number of vertebrate genomes to compare with our own, this type of approach is set to become extremely powerful, provided the right analytical tools are available.

Results: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate. Here, we propose a radically different approach to the detection of non-coding functional region: instead of measuring past evolutionary rates, we build a machine learning classifier to predict current substitution rates in human based on the inferred evolutionary events that affected the region during vertebrate evolution. We show that different types of evolutionary events, occurring along different branches of the phylogenetic tree, bring very different amounts of information. We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites. Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches.

Availability: The predictor and predictions made are available at http://www.mcb.mcgill.ca/~blanchem/sadri.

Contact: blanchem@mcb.mcgill.ca.

Show MeSH

Related in: MedlinePlus

Individual feature informativeness. (a) Log-likelihood ratio of human conservation in the presence or absence of orthologous bases in the given species and at the given offset from the considered position. (b) Log-likelihood ratio of human conservation in the presence of a conservation versus a substitution along the given branch and at the given offset. For both (a) and (b), ratios for non-mammalian species are too noisy and are not shown. (c) Mutual information between the presence of human conservation and the event along each branch at each offset.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117352&req=5

Figure 2: Individual feature informativeness. (a) Log-likelihood ratio of human conservation in the presence or absence of orthologous bases in the given species and at the given offset from the considered position. (b) Log-likelihood ratio of human conservation in the presence of a conservation versus a substitution along the given branch and at the given offset. For both (a) and (b), ratios for non-mammalian species are too noisy and are not shown. (c) Mutual information between the presence of human conservation and the event along each branch at each offset.

Mentions: We first measured how informative are individual events along each branch of the tree. This information can be measured by several means. First, we consider the question of whether the presence of orthologous bases in a given species (extant or ancestral) affects the likelihood of a conservation event along the human branch. A human site may have no detectable ortholog in a given species s for several reasons: (i) Site i was inserted after the last common ancestor of s and human [denoted LCA(s,human)]; (ii) Site i was deleted since the LCA(s,human) along the lineage leading to s; (iii) Site i actually has an ortholog in s, but that and the surrounding sequence have diverged to the point where orthology cannot be detected (or, in the case of ancestral sequences, none of its descendant has a detectable ortholog). Figure 2a plots the likelihood ratio of human conservation in the presence or absence of an orthologous base on branch b at site i+δ: . As expected, one observes that detectable orthology is relatively uninformative for primate species, as the vast majority of both functional and non-functional human sites have orthologs in these species. However, the value of orthology increases as we consider more divergent species, especially fast evolving ones such as rodents. This is because for highly diverged species, an increasing fraction of non-functional regions are either deleted or mutated beyond recognition, thus concentrating human functional sites in the fraction of sites with detectable orthologs. For example, sites with orthologs in other primate species are only ~7% more likely to be conserved than those without primate orthologs, but this number increases to 16% for other eutherians, 23% for marsupials and 33% for birds and reptiles. The trend presumably continues for more distant species such as fish, but we have insufficient data to observe it. It is interesting to consider how the events occurring at neighboring sites at position i+δ are also quite informative on the fate of site i, even for large δ. It appears that the presence of bases with a human ortholog even located 250 bp away from the current site is only marginally less informative than considering orthology at the site itself. This is due to the fact that functional regions and detectable orthology blocks are generally quite large.Fig. 2.


Predicting site-specific human selective pressure using evolutionary signatures.

Sadri J, Diallo AB, Blanchette M - Bioinformatics (2011)

Individual feature informativeness. (a) Log-likelihood ratio of human conservation in the presence or absence of orthologous bases in the given species and at the given offset from the considered position. (b) Log-likelihood ratio of human conservation in the presence of a conservation versus a substitution along the given branch and at the given offset. For both (a) and (b), ratios for non-mammalian species are too noisy and are not shown. (c) Mutual information between the presence of human conservation and the event along each branch at each offset.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117352&req=5

Figure 2: Individual feature informativeness. (a) Log-likelihood ratio of human conservation in the presence or absence of orthologous bases in the given species and at the given offset from the considered position. (b) Log-likelihood ratio of human conservation in the presence of a conservation versus a substitution along the given branch and at the given offset. For both (a) and (b), ratios for non-mammalian species are too noisy and are not shown. (c) Mutual information between the presence of human conservation and the event along each branch at each offset.
Mentions: We first measured how informative are individual events along each branch of the tree. This information can be measured by several means. First, we consider the question of whether the presence of orthologous bases in a given species (extant or ancestral) affects the likelihood of a conservation event along the human branch. A human site may have no detectable ortholog in a given species s for several reasons: (i) Site i was inserted after the last common ancestor of s and human [denoted LCA(s,human)]; (ii) Site i was deleted since the LCA(s,human) along the lineage leading to s; (iii) Site i actually has an ortholog in s, but that and the surrounding sequence have diverged to the point where orthology cannot be detected (or, in the case of ancestral sequences, none of its descendant has a detectable ortholog). Figure 2a plots the likelihood ratio of human conservation in the presence or absence of an orthologous base on branch b at site i+δ: . As expected, one observes that detectable orthology is relatively uninformative for primate species, as the vast majority of both functional and non-functional human sites have orthologs in these species. However, the value of orthology increases as we consider more divergent species, especially fast evolving ones such as rodents. This is because for highly diverged species, an increasing fraction of non-functional regions are either deleted or mutated beyond recognition, thus concentrating human functional sites in the fraction of sites with detectable orthologs. For example, sites with orthologs in other primate species are only ~7% more likely to be conserved than those without primate orthologs, but this number increases to 16% for other eutherians, 23% for marsupials and 33% for birds and reptiles. The trend presumably continues for more distant species such as fish, but we have insufficient data to observe it. It is interesting to consider how the events occurring at neighboring sites at position i+δ are also quite informative on the fate of site i, even for large δ. It appears that the presence of bases with a human ortholog even located 250 bp away from the current site is only marginally less informative than considering orthology at the site itself. This is due to the fact that functional regions and detectable orthology blocks are generally quite large.Fig. 2.

Bottom Line: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate.We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites.Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science, McGill University, 3630 University, Montreal, QC, Canada H3A 2B2.

ABSTRACT

Motivation: The identification of non-coding functional regions of the human genome remains one of the main challenges of genomics. By observing how a given region evolved over time, one can detect signs of negative or positive selection hinting that the region may be functional. With the quickly increasing number of vertebrate genomes to compare with our own, this type of approach is set to become extremely powerful, provided the right analytical tools are available.

Results: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate. Here, we propose a radically different approach to the detection of non-coding functional region: instead of measuring past evolutionary rates, we build a machine learning classifier to predict current substitution rates in human based on the inferred evolutionary events that affected the region during vertebrate evolution. We show that different types of evolutionary events, occurring along different branches of the phylogenetic tree, bring very different amounts of information. We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites. Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches.

Availability: The predictor and predictions made are available at http://www.mcb.mcgill.ca/~blanchem/sadri.

Contact: blanchem@mcb.mcgill.ca.

Show MeSH
Related in: MedlinePlus