Limits...
Insight into neutral and disease-associated human genetic variants through interpretable predictors.

van den Berg BA, Reinders MJ, de Ridder D, de Beer TA - PLoS ONE (2015)

Bottom Line: The high performances of current sequence-based predictors indicate that sequence data contains valuable information about a variant being neutral or disease-associated.However, most predictors do not readily disclose this information, and so it remains unclear what sequence properties are most important.Furthermore, for large sets of known variants, it can provide insight into the mechanisms responsible for variants being disease-associated.

View Article: PubMed Central - PubMed

Affiliation: Delft Bioinformatics Lab, Department of Intelligent Systems, Faculty Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628CD, Delft, The Netherlands; Netherlands Bioinformatics Centre, Nijmegen, The Netherlands; Kluyver Centre for Genomics of Industrial Fermentation, Delft, The Netherlands.

ABSTRACT
A variety of methods that predict human nonsynonymous single nucleotide polymorphisms (SNPs) to be neutral or disease-associated have been developed over the last decade. These methods are used for pinpointing disease-associated variants in the many variants obtained with next-generation sequencing technologies. The high performances of current sequence-based predictors indicate that sequence data contains valuable information about a variant being neutral or disease-associated. However, most predictors do not readily disclose this information, and so it remains unclear what sequence properties are most important. Here, we show how we can obtain insight into sequence characteristics of variants and their surroundings by interpreting predictors. We used an extensive range of features derived from the variant itself, its surrounding sequence, sequence conservation, and sequence annotation, and employed linear support vector machine classifiers to enable extracting feature importance from trained predictors. Our approach is useful for providing additional information about what features are most important for the predictions made. Furthermore, for large sets of known variants, it can provide insight into the mechanisms responsible for variants being disease-associated.

No MeSH data available.


Related in: MedlinePlus

Amino acid substitution feature weights.a) Heat map showing feature weights obtained from classifiers trained using the amino acid substitution features. The rows show feature weights obtained per variant subset classifier. The single row at the bottom shows feature weights obtained from a classifier trained on the entire set of variants. The rows and columns are ordered based on amino acid properties [23]. Low (blue) and high (red) weights indicate that the feature is predictive for neutral and disease-associated variants respectively. Gray cells indicate amino acid substitutions that do not occur in the data set, because these substitutions require more than one mutation in the reference codon. b) Heat map showing log odds ratios between neutral and disease-associated variants that were obtained by counting the amino acid substitutions in our data set. Here, low (blue) and high (red) values indicate that substitutions occur relatively often in the set of neutral and disease-associated variants, respectively.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4380319&req=5

pone.0120729.g003: Amino acid substitution feature weights.a) Heat map showing feature weights obtained from classifiers trained using the amino acid substitution features. The rows show feature weights obtained per variant subset classifier. The single row at the bottom shows feature weights obtained from a classifier trained on the entire set of variants. The rows and columns are ordered based on amino acid properties [23]. Low (blue) and high (red) weights indicate that the feature is predictive for neutral and disease-associated variants respectively. Gray cells indicate amino acid substitutions that do not occur in the data set, because these substitutions require more than one mutation in the reference codon. b) Heat map showing log odds ratios between neutral and disease-associated variants that were obtained by counting the amino acid substitutions in our data set. Here, low (blue) and high (red) values indicate that substitutions occur relatively often in the set of neutral and disease-associated variants, respectively.

Mentions: Extracted feature weights from classifiers trained using the amino acid substitution features are visualized using a heat map in Fig. 3a. Here, each row shows feature weights obtained from one subset classifier, i.e. a classifier trained on one of the variant subsets. For example, the colors in the top row correspond to the weights obtained from the classifier trained on all variants with aspartic acid (D) as reference amino acid. A positive weight (red) indicates that the feature (the mutant amino acid in this case) is predictive for disease-association whereas a negative weight (blue) indicates importance for neutral variants. The higher the (absolute) weight, the higher the feature importance. Using the top row as example again, the low weight of the glutamic acid feature (column E) indicates that a substitution from aspartic acid to glutamic acid is relatively safe, whereas the high weight of the glycine feature (column G) indicates that a substitution from aspartic acid to glycine is relatively dangerous. Gray elements indicate amino acid substitutions that do not occur in our data set, since these require more than one mutation at the nucleotide level. Additionally, the feature weights obtained from the classifier that was trained on the entire data set are shown in the single row at the bottom.


Insight into neutral and disease-associated human genetic variants through interpretable predictors.

van den Berg BA, Reinders MJ, de Ridder D, de Beer TA - PLoS ONE (2015)

Amino acid substitution feature weights.a) Heat map showing feature weights obtained from classifiers trained using the amino acid substitution features. The rows show feature weights obtained per variant subset classifier. The single row at the bottom shows feature weights obtained from a classifier trained on the entire set of variants. The rows and columns are ordered based on amino acid properties [23]. Low (blue) and high (red) weights indicate that the feature is predictive for neutral and disease-associated variants respectively. Gray cells indicate amino acid substitutions that do not occur in the data set, because these substitutions require more than one mutation in the reference codon. b) Heat map showing log odds ratios between neutral and disease-associated variants that were obtained by counting the amino acid substitutions in our data set. Here, low (blue) and high (red) values indicate that substitutions occur relatively often in the set of neutral and disease-associated variants, respectively.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4380319&req=5

pone.0120729.g003: Amino acid substitution feature weights.a) Heat map showing feature weights obtained from classifiers trained using the amino acid substitution features. The rows show feature weights obtained per variant subset classifier. The single row at the bottom shows feature weights obtained from a classifier trained on the entire set of variants. The rows and columns are ordered based on amino acid properties [23]. Low (blue) and high (red) weights indicate that the feature is predictive for neutral and disease-associated variants respectively. Gray cells indicate amino acid substitutions that do not occur in the data set, because these substitutions require more than one mutation in the reference codon. b) Heat map showing log odds ratios between neutral and disease-associated variants that were obtained by counting the amino acid substitutions in our data set. Here, low (blue) and high (red) values indicate that substitutions occur relatively often in the set of neutral and disease-associated variants, respectively.
Mentions: Extracted feature weights from classifiers trained using the amino acid substitution features are visualized using a heat map in Fig. 3a. Here, each row shows feature weights obtained from one subset classifier, i.e. a classifier trained on one of the variant subsets. For example, the colors in the top row correspond to the weights obtained from the classifier trained on all variants with aspartic acid (D) as reference amino acid. A positive weight (red) indicates that the feature (the mutant amino acid in this case) is predictive for disease-association whereas a negative weight (blue) indicates importance for neutral variants. The higher the (absolute) weight, the higher the feature importance. Using the top row as example again, the low weight of the glutamic acid feature (column E) indicates that a substitution from aspartic acid to glutamic acid is relatively safe, whereas the high weight of the glycine feature (column G) indicates that a substitution from aspartic acid to glycine is relatively dangerous. Gray elements indicate amino acid substitutions that do not occur in our data set, since these require more than one mutation at the nucleotide level. Additionally, the feature weights obtained from the classifier that was trained on the entire data set are shown in the single row at the bottom.

Bottom Line: The high performances of current sequence-based predictors indicate that sequence data contains valuable information about a variant being neutral or disease-associated.However, most predictors do not readily disclose this information, and so it remains unclear what sequence properties are most important.Furthermore, for large sets of known variants, it can provide insight into the mechanisms responsible for variants being disease-associated.

View Article: PubMed Central - PubMed

Affiliation: Delft Bioinformatics Lab, Department of Intelligent Systems, Faculty Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628CD, Delft, The Netherlands; Netherlands Bioinformatics Centre, Nijmegen, The Netherlands; Kluyver Centre for Genomics of Industrial Fermentation, Delft, The Netherlands.

ABSTRACT
A variety of methods that predict human nonsynonymous single nucleotide polymorphisms (SNPs) to be neutral or disease-associated have been developed over the last decade. These methods are used for pinpointing disease-associated variants in the many variants obtained with next-generation sequencing technologies. The high performances of current sequence-based predictors indicate that sequence data contains valuable information about a variant being neutral or disease-associated. However, most predictors do not readily disclose this information, and so it remains unclear what sequence properties are most important. Here, we show how we can obtain insight into sequence characteristics of variants and their surroundings by interpreting predictors. We used an extensive range of features derived from the variant itself, its surrounding sequence, sequence conservation, and sequence annotation, and employed linear support vector machine classifiers to enable extracting feature importance from trained predictors. Our approach is useful for providing additional information about what features are most important for the predictions made. Furthermore, for large sets of known variants, it can provide insight into the mechanisms responsible for variants being disease-associated.

No MeSH data available.


Related in: MedlinePlus