Limits...
Insight into neutral and disease-associated human genetic variants through interpretable predictors.

van den Berg BA, Reinders MJ, de Ridder D, de Beer TA - PLoS ONE (2015)

Bottom Line: The high performances of current sequence-based predictors indicate that sequence data contains valuable information about a variant being neutral or disease-associated.However, most predictors do not readily disclose this information, and so it remains unclear what sequence properties are most important.Furthermore, for large sets of known variants, it can provide insight into the mechanisms responsible for variants being disease-associated.

View Article: PubMed Central - PubMed

Affiliation: Delft Bioinformatics Lab, Department of Intelligent Systems, Faculty Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628CD, Delft, The Netherlands; Netherlands Bioinformatics Centre, Nijmegen, The Netherlands; Kluyver Centre for Genomics of Industrial Fermentation, Delft, The Netherlands.

ABSTRACT
A variety of methods that predict human nonsynonymous single nucleotide polymorphisms (SNPs) to be neutral or disease-associated have been developed over the last decade. These methods are used for pinpointing disease-associated variants in the many variants obtained with next-generation sequencing technologies. The high performances of current sequence-based predictors indicate that sequence data contains valuable information about a variant being neutral or disease-associated. However, most predictors do not readily disclose this information, and so it remains unclear what sequence properties are most important. Here, we show how we can obtain insight into sequence characteristics of variants and their surroundings by interpreting predictors. We used an extensive range of features derived from the variant itself, its surrounding sequence, sequence conservation, and sequence annotation, and employed linear support vector machine classifiers to enable extracting feature importance from trained predictors. Our approach is useful for providing additional information about what features are most important for the predictions made. Furthermore, for large sets of known variants, it can provide insight into the mechanisms responsible for variants being disease-associated.

No MeSH data available.


ROC-curves showing classifier performances using all features.In blue, performances for linear support machines using the combined subset classifier approach (CS), and for a classifier trained on the entire set of variants (CE). In gray the performance of a non-linear support vector machine (RBF kernel) trained on the entire set of variants.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4380319&req=5

pone.0120729.g006: ROC-curves showing classifier performances using all features.In blue, performances for linear support machines using the combined subset classifier approach (CS), and for a classifier trained on the entire set of variants (CE). In gray the performance of a non-linear support vector machine (RBF kernel) trained on the entire set of variants.

Mentions: Resulting performances are given in Table 3 (more results can be found in S2 Table and S4 Fig.). In case of the linear support vector machines, subset classifiers (CS) consistently outperformed the classifiers trained on the entire set of variations (CE). The subset approach thus not only improves interpretation, it also results in better classification performances (for linear classifiers). Best performance was obtained using the subset classifier trained on all features, resulting in 0.833 AUC (Fig. 6).


Insight into neutral and disease-associated human genetic variants through interpretable predictors.

van den Berg BA, Reinders MJ, de Ridder D, de Beer TA - PLoS ONE (2015)

ROC-curves showing classifier performances using all features.In blue, performances for linear support machines using the combined subset classifier approach (CS), and for a classifier trained on the entire set of variants (CE). In gray the performance of a non-linear support vector machine (RBF kernel) trained on the entire set of variants.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4380319&req=5

pone.0120729.g006: ROC-curves showing classifier performances using all features.In blue, performances for linear support machines using the combined subset classifier approach (CS), and for a classifier trained on the entire set of variants (CE). In gray the performance of a non-linear support vector machine (RBF kernel) trained on the entire set of variants.
Mentions: Resulting performances are given in Table 3 (more results can be found in S2 Table and S4 Fig.). In case of the linear support vector machines, subset classifiers (CS) consistently outperformed the classifiers trained on the entire set of variations (CE). The subset approach thus not only improves interpretation, it also results in better classification performances (for linear classifiers). Best performance was obtained using the subset classifier trained on all features, resulting in 0.833 AUC (Fig. 6).

Bottom Line: The high performances of current sequence-based predictors indicate that sequence data contains valuable information about a variant being neutral or disease-associated.However, most predictors do not readily disclose this information, and so it remains unclear what sequence properties are most important.Furthermore, for large sets of known variants, it can provide insight into the mechanisms responsible for variants being disease-associated.

View Article: PubMed Central - PubMed

Affiliation: Delft Bioinformatics Lab, Department of Intelligent Systems, Faculty Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628CD, Delft, The Netherlands; Netherlands Bioinformatics Centre, Nijmegen, The Netherlands; Kluyver Centre for Genomics of Industrial Fermentation, Delft, The Netherlands.

ABSTRACT
A variety of methods that predict human nonsynonymous single nucleotide polymorphisms (SNPs) to be neutral or disease-associated have been developed over the last decade. These methods are used for pinpointing disease-associated variants in the many variants obtained with next-generation sequencing technologies. The high performances of current sequence-based predictors indicate that sequence data contains valuable information about a variant being neutral or disease-associated. However, most predictors do not readily disclose this information, and so it remains unclear what sequence properties are most important. Here, we show how we can obtain insight into sequence characteristics of variants and their surroundings by interpreting predictors. We used an extensive range of features derived from the variant itself, its surrounding sequence, sequence conservation, and sequence annotation, and employed linear support vector machine classifiers to enable extracting feature importance from trained predictors. Our approach is useful for providing additional information about what features are most important for the predictions made. Furthermore, for large sets of known variants, it can provide insight into the mechanisms responsible for variants being disease-associated.

No MeSH data available.