Limits...
Predicting site-specific human selective pressure using evolutionary signatures.

Sadri J, Diallo AB, Blanchette M - Bioinformatics (2011)

Bottom Line: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate.We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites.Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science, McGill University, 3630 University, Montreal, QC, Canada H3A 2B2.

ABSTRACT

Motivation: The identification of non-coding functional regions of the human genome remains one of the main challenges of genomics. By observing how a given region evolved over time, one can detect signs of negative or positive selection hinting that the region may be functional. With the quickly increasing number of vertebrate genomes to compare with our own, this type of approach is set to become extremely powerful, provided the right analytical tools are available.

Results: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate. Here, we propose a radically different approach to the detection of non-coding functional region: instead of measuring past evolutionary rates, we build a machine learning classifier to predict current substitution rates in human based on the inferred evolutionary events that affected the region during vertebrate evolution. We show that different types of evolutionary events, occurring along different branches of the phylogenetic tree, bring very different amounts of information. We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites. Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches.

Availability: The predictor and predictions made are available at http://www.mcb.mcgill.ca/~blanchem/sadri.

Contact: blanchem@mcb.mcgill.ca.

Show MeSH
Performance of various previously published measures of sequence conservation (PhastCons, PhyloP-SCORE, GERP, PhyloP-LRT), compared with predictors developed in this article. X-axis: fraction of test examples predicted as positive; Y-axis: positive predictive value (fraction of human-conservation predictions that are indeed human conserved).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117352&req=5

Figure 3: Performance of various previously published measures of sequence conservation (PhastCons, PhyloP-SCORE, GERP, PhyloP-LRT), compared with predictors developed in this article. X-axis: fraction of test examples predicted as positive; Y-axis: positive predictive value (fraction of human-conservation predictions that are indeed human conserved).

Mentions: All three types of predictors produced poor results when large values of w were used. The Naive Bayes and KNN classifiers performed best on feature set 2 (with w=0), whereas the SVM-based approaches were unable to handle the large number of features this set contains. On the other hand, the SVM approach was very effective at using the smaller number of features from Feature Set 1 and produced good results for both w=0 and 1. Figure 3 shows the positive predictive values (PPV, defined as the ratio of the number of true positive predictions to the number of positive predictions) obtained for each of these classifiers. Because we expect the fraction of functional sites in our balanced training and testing sets to be relatively small (probably around 5 to 10%), we only plot PPVs for prediction thresholds resulting in up to 20% of the test examples being predicted positive. The two SVM predictors (Feature Set 1, w=0 or 1) clearly outperforms all other approaches over much of the range of prediction threshold. The Naive Bayes and KNN predictors perform relatively poorly for high-confidence predictions, although they become competitive with the two SVM predictors at lower confidence calls.Fig. 3.


Predicting site-specific human selective pressure using evolutionary signatures.

Sadri J, Diallo AB, Blanchette M - Bioinformatics (2011)

Performance of various previously published measures of sequence conservation (PhastCons, PhyloP-SCORE, GERP, PhyloP-LRT), compared with predictors developed in this article. X-axis: fraction of test examples predicted as positive; Y-axis: positive predictive value (fraction of human-conservation predictions that are indeed human conserved).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117352&req=5

Figure 3: Performance of various previously published measures of sequence conservation (PhastCons, PhyloP-SCORE, GERP, PhyloP-LRT), compared with predictors developed in this article. X-axis: fraction of test examples predicted as positive; Y-axis: positive predictive value (fraction of human-conservation predictions that are indeed human conserved).
Mentions: All three types of predictors produced poor results when large values of w were used. The Naive Bayes and KNN classifiers performed best on feature set 2 (with w=0), whereas the SVM-based approaches were unable to handle the large number of features this set contains. On the other hand, the SVM approach was very effective at using the smaller number of features from Feature Set 1 and produced good results for both w=0 and 1. Figure 3 shows the positive predictive values (PPV, defined as the ratio of the number of true positive predictions to the number of positive predictions) obtained for each of these classifiers. Because we expect the fraction of functional sites in our balanced training and testing sets to be relatively small (probably around 5 to 10%), we only plot PPVs for prediction thresholds resulting in up to 20% of the test examples being predicted positive. The two SVM predictors (Feature Set 1, w=0 or 1) clearly outperforms all other approaches over much of the range of prediction threshold. The Naive Bayes and KNN predictors perform relatively poorly for high-confidence predictions, although they become competitive with the two SVM predictors at lower confidence calls.Fig. 3.

Bottom Line: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate.We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites.Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science, McGill University, 3630 University, Montreal, QC, Canada H3A 2B2.

ABSTRACT

Motivation: The identification of non-coding functional regions of the human genome remains one of the main challenges of genomics. By observing how a given region evolved over time, one can detect signs of negative or positive selection hinting that the region may be functional. With the quickly increasing number of vertebrate genomes to compare with our own, this type of approach is set to become extremely powerful, provided the right analytical tools are available.

Results: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate. Here, we propose a radically different approach to the detection of non-coding functional region: instead of measuring past evolutionary rates, we build a machine learning classifier to predict current substitution rates in human based on the inferred evolutionary events that affected the region during vertebrate evolution. We show that different types of evolutionary events, occurring along different branches of the phylogenetic tree, bring very different amounts of information. We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites. Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches.

Availability: The predictor and predictions made are available at http://www.mcb.mcgill.ca/~blanchem/sadri.

Contact: blanchem@mcb.mcgill.ca.

Show MeSH