Limits...
Insight into neutral and disease-associated human genetic variants through interpretable predictors.

van den Berg BA, Reinders MJ, de Ridder D, de Beer TA - PLoS ONE (2015)

Bottom Line: The high performances of current sequence-based predictors indicate that sequence data contains valuable information about a variant being neutral or disease-associated.However, most predictors do not readily disclose this information, and so it remains unclear what sequence properties are most important.Furthermore, for large sets of known variants, it can provide insight into the mechanisms responsible for variants being disease-associated.

View Article: PubMed Central - PubMed

Affiliation: Delft Bioinformatics Lab, Department of Intelligent Systems, Faculty Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628CD, Delft, The Netherlands; Netherlands Bioinformatics Centre, Nijmegen, The Netherlands; Kluyver Centre for Genomics of Industrial Fermentation, Delft, The Netherlands.

ABSTRACT
A variety of methods that predict human nonsynonymous single nucleotide polymorphisms (SNPs) to be neutral or disease-associated have been developed over the last decade. These methods are used for pinpointing disease-associated variants in the many variants obtained with next-generation sequencing technologies. The high performances of current sequence-based predictors indicate that sequence data contains valuable information about a variant being neutral or disease-associated. However, most predictors do not readily disclose this information, and so it remains unclear what sequence properties are most important. Here, we show how we can obtain insight into sequence characteristics of variants and their surroundings by interpreting predictors. We used an extensive range of features derived from the variant itself, its surrounding sequence, sequence conservation, and sequence annotation, and employed linear support vector machine classifiers to enable extracting feature importance from trained predictors. Our approach is useful for providing additional information about what features are most important for the predictions made. Furthermore, for large sets of known variants, it can provide insight into the mechanisms responsible for variants being disease-associated.

No MeSH data available.


Extracting feature weights from trained classifiers.a) For illustration, objects in two classes (blue and red) are represented by rectangles and characterized by the features “width” and “height”. b) By measuring widths and heights, objects are mapped to a two dimensional grid (feature space). Classifier training results in the decision boundary that separates the two classes of objects. c) Feature importance can be deducted from the slope of the decision boundary. The height is more important than the width, hence the higher (absolute) weight for this feature. The sign indicates for what class the feature is predictive. Blue rectangles are generally wider, hence the negative weight for the width feature. Red rectangles are generally taller, hence the positive weight for the height feature.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4380319&req=5

pone.0120729.g001: Extracting feature weights from trained classifiers.a) For illustration, objects in two classes (blue and red) are represented by rectangles and characterized by the features “width” and “height”. b) By measuring widths and heights, objects are mapped to a two dimensional grid (feature space). Classifier training results in the decision boundary that separates the two classes of objects. c) Feature importance can be deducted from the slope of the decision boundary. The height is more important than the width, hence the higher (absolute) weight for this feature. The sign indicates for what class the feature is predictive. Blue rectangles are generally wider, hence the negative weight for the width feature. Red rectangles are generally taller, hence the positive weight for the height feature.

Mentions: We used linear support vector machines, allowing us to extract feature weights from trained classifiers. A high weight indicates a strong contribution of a certain feature to the classifier outcome, and its sign indicates if it is predictive for neutral (negative weight) or disease-associated (positive weight) variants (Fig. 1). To further enhance interpretation potential and performance of the linear classifiers, we trained separate classifiers on subsets that contain variants with the same reference amino acid. This was done based on the assumption that feature importance might be different per type of amino acid substitution. For example, a surrounding sequence with many small amino acids might be a high risk in case of substitutions from small to large amino acids, whereas substitutions from small to other small amino acids in the same surrounding might have a lower risk. Extracting feature importance from classifiers trained on the variant subsets could help in revealing such differences. Although it is not the aim of this paper to introduce a competitive predictor, we demonstrate that classifiers can be made interpretable without significant loss in prediction performance.


Insight into neutral and disease-associated human genetic variants through interpretable predictors.

van den Berg BA, Reinders MJ, de Ridder D, de Beer TA - PLoS ONE (2015)

Extracting feature weights from trained classifiers.a) For illustration, objects in two classes (blue and red) are represented by rectangles and characterized by the features “width” and “height”. b) By measuring widths and heights, objects are mapped to a two dimensional grid (feature space). Classifier training results in the decision boundary that separates the two classes of objects. c) Feature importance can be deducted from the slope of the decision boundary. The height is more important than the width, hence the higher (absolute) weight for this feature. The sign indicates for what class the feature is predictive. Blue rectangles are generally wider, hence the negative weight for the width feature. Red rectangles are generally taller, hence the positive weight for the height feature.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4380319&req=5

pone.0120729.g001: Extracting feature weights from trained classifiers.a) For illustration, objects in two classes (blue and red) are represented by rectangles and characterized by the features “width” and “height”. b) By measuring widths and heights, objects are mapped to a two dimensional grid (feature space). Classifier training results in the decision boundary that separates the two classes of objects. c) Feature importance can be deducted from the slope of the decision boundary. The height is more important than the width, hence the higher (absolute) weight for this feature. The sign indicates for what class the feature is predictive. Blue rectangles are generally wider, hence the negative weight for the width feature. Red rectangles are generally taller, hence the positive weight for the height feature.
Mentions: We used linear support vector machines, allowing us to extract feature weights from trained classifiers. A high weight indicates a strong contribution of a certain feature to the classifier outcome, and its sign indicates if it is predictive for neutral (negative weight) or disease-associated (positive weight) variants (Fig. 1). To further enhance interpretation potential and performance of the linear classifiers, we trained separate classifiers on subsets that contain variants with the same reference amino acid. This was done based on the assumption that feature importance might be different per type of amino acid substitution. For example, a surrounding sequence with many small amino acids might be a high risk in case of substitutions from small to large amino acids, whereas substitutions from small to other small amino acids in the same surrounding might have a lower risk. Extracting feature importance from classifiers trained on the variant subsets could help in revealing such differences. Although it is not the aim of this paper to introduce a competitive predictor, we demonstrate that classifiers can be made interpretable without significant loss in prediction performance.

Bottom Line: The high performances of current sequence-based predictors indicate that sequence data contains valuable information about a variant being neutral or disease-associated.However, most predictors do not readily disclose this information, and so it remains unclear what sequence properties are most important.Furthermore, for large sets of known variants, it can provide insight into the mechanisms responsible for variants being disease-associated.

View Article: PubMed Central - PubMed

Affiliation: Delft Bioinformatics Lab, Department of Intelligent Systems, Faculty Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628CD, Delft, The Netherlands; Netherlands Bioinformatics Centre, Nijmegen, The Netherlands; Kluyver Centre for Genomics of Industrial Fermentation, Delft, The Netherlands.

ABSTRACT
A variety of methods that predict human nonsynonymous single nucleotide polymorphisms (SNPs) to be neutral or disease-associated have been developed over the last decade. These methods are used for pinpointing disease-associated variants in the many variants obtained with next-generation sequencing technologies. The high performances of current sequence-based predictors indicate that sequence data contains valuable information about a variant being neutral or disease-associated. However, most predictors do not readily disclose this information, and so it remains unclear what sequence properties are most important. Here, we show how we can obtain insight into sequence characteristics of variants and their surroundings by interpreting predictors. We used an extensive range of features derived from the variant itself, its surrounding sequence, sequence conservation, and sequence annotation, and employed linear support vector machine classifiers to enable extracting feature importance from trained predictors. Our approach is useful for providing additional information about what features are most important for the predictions made. Furthermore, for large sets of known variants, it can provide insight into the mechanisms responsible for variants being disease-associated.

No MeSH data available.