Limits...
Automatic detection of exonic splicing enhancers (ESEs) using SVMs.

Mersch B, Gepperth A, Suhai S, Hotz-Wagenblatt A - BMC Bioinformatics (2008)

Bottom Line: Once the classifier is trained, every potential ESE sequence can be passed to the SVM for verification.Using SVMs with the combined oligo kernel yields a high accuracy of about 90 percent and well interpretable parameters.The best results were obtained using an SVM with the combined oligo kernel, while oligo kernels with oligomers of a certain length could be used to extract relevant features.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Molecular Biophysics, German Cancer Research Center DKFZ, Im Neuenheimer Feld 580, Heidelberg, Germany. b.mersch@dkfz.de

ABSTRACT

Background: Exonic splicing enhancers (ESEs) activate nearby splice sites and promote the inclusion (vs. exclusion) of exons in which they reside, while being a binding site for SR proteins. To study the impact of ESEs on alternative splicing it would be useful to have a possibility to detect them in exons. Identifying SR protein-binding sites in human DNA sequences by machine learning techniques is a formidable task, since the exon sequences are also constrained by their functional role in coding for proteins.

Results: The choice of training examples needed for machine learning approaches is difficult since there are only few exact locations of human ESEs described in the literature which could be considered as positive examples. Additionally, it is unclear which sequences are suitable as negative examples. Therefore, we developed a motif-oriented data-extraction method that extracts exon sequences around experimentally or theoretically determined ESE patterns. Positive examples are restricted by heuristics based on known properties of ESEs, e.g. location in the vicinity of a splice site, whereas negative examples are taken in the same way from the middle of long exons. We show that a suitably chosen SVM using optimized sequence kernels (e.g., combined oligo kernel) can extract meaningful properties from these training examples. Once the classifier is trained, every potential ESE sequence can be passed to the SVM for verification. Using SVMs with the combined oligo kernel yields a high accuracy of about 90 percent and well interpretable parameters.

Conclusion: The motif-oriented data-extraction method seems to produce consistent training and test data leading to good classification rates and thus allows verification of potential ESE motifs. The best results were obtained using an SVM with the combined oligo kernel, while oligo kernels with oligomers of a certain length could be used to extract relevant features.

Show MeSH
Image matrix of discriminative weight functions. The image was derived from the trained classifiers based on the trimer, tetramer, pentamer or hexamer kernel. Each of the lines shows the values of one specific weight function obtained from an average over 50 runs. Each of the 200 columns corresponds to a certain sequence position. By construction, the exonic splicing enhancer motif starts at position 100. For noise reduction all matrix elements below 0.25 have been zeroed.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2567995&req=5

Figure 3: Image matrix of discriminative weight functions. The image was derived from the trained classifiers based on the trimer, tetramer, pentamer or hexamer kernel. Each of the lines shows the values of one specific weight function obtained from an average over 50 runs. Each of the 200 columns corresponds to a certain sequence position. By construction, the exonic splicing enhancer motif starts at position 100. For noise reduction all matrix elements below 0.25 have been zeroed.

Mentions: Figure 3 exhibits the positions in the sequences at which relevant features are located. High positive (negative) values correspond to relevant features for discriminating positive (negative) examples. One can see that oligomers that were important for the classification were often located in the middle of the sequences. This seems to be consistent as the exonic splicing enhancers are, by construction, always located in the middle of the training data [see Methods] and these oligomers are contained in a large group of ESEs known as purine-rich enhancers containing repeated GAR (GAA or GAG) trinucleotides. Additionally, it can be inferred that oligomers which were important for the classification of positive examples (red in Figure 3) are mostly composed of purines but with a higher amount of adenine. In the negative examples (blue in Figure 3) this is inverted and guanine was more frequently present. This correlates to the fact that the most frequent middle-motifs in negative examples were GGAGGA or GAGGAG. In positive examples GAAGAA or AAGAAG were most frequent. To check whether the classifier simply did not recognize these differences, we examined the classification performance using only the frequencies of the hexamers in the middle of the sequence [see Methods]. Using only this information we obtained a classification rate of 66.8% showing that other features must play an important role for classification as well.


Automatic detection of exonic splicing enhancers (ESEs) using SVMs.

Mersch B, Gepperth A, Suhai S, Hotz-Wagenblatt A - BMC Bioinformatics (2008)

Image matrix of discriminative weight functions. The image was derived from the trained classifiers based on the trimer, tetramer, pentamer or hexamer kernel. Each of the lines shows the values of one specific weight function obtained from an average over 50 runs. Each of the 200 columns corresponds to a certain sequence position. By construction, the exonic splicing enhancer motif starts at position 100. For noise reduction all matrix elements below 0.25 have been zeroed.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2567995&req=5

Figure 3: Image matrix of discriminative weight functions. The image was derived from the trained classifiers based on the trimer, tetramer, pentamer or hexamer kernel. Each of the lines shows the values of one specific weight function obtained from an average over 50 runs. Each of the 200 columns corresponds to a certain sequence position. By construction, the exonic splicing enhancer motif starts at position 100. For noise reduction all matrix elements below 0.25 have been zeroed.
Mentions: Figure 3 exhibits the positions in the sequences at which relevant features are located. High positive (negative) values correspond to relevant features for discriminating positive (negative) examples. One can see that oligomers that were important for the classification were often located in the middle of the sequences. This seems to be consistent as the exonic splicing enhancers are, by construction, always located in the middle of the training data [see Methods] and these oligomers are contained in a large group of ESEs known as purine-rich enhancers containing repeated GAR (GAA or GAG) trinucleotides. Additionally, it can be inferred that oligomers which were important for the classification of positive examples (red in Figure 3) are mostly composed of purines but with a higher amount of adenine. In the negative examples (blue in Figure 3) this is inverted and guanine was more frequently present. This correlates to the fact that the most frequent middle-motifs in negative examples were GGAGGA or GAGGAG. In positive examples GAAGAA or AAGAAG were most frequent. To check whether the classifier simply did not recognize these differences, we examined the classification performance using only the frequencies of the hexamers in the middle of the sequence [see Methods]. Using only this information we obtained a classification rate of 66.8% showing that other features must play an important role for classification as well.

Bottom Line: Once the classifier is trained, every potential ESE sequence can be passed to the SVM for verification.Using SVMs with the combined oligo kernel yields a high accuracy of about 90 percent and well interpretable parameters.The best results were obtained using an SVM with the combined oligo kernel, while oligo kernels with oligomers of a certain length could be used to extract relevant features.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Molecular Biophysics, German Cancer Research Center DKFZ, Im Neuenheimer Feld 580, Heidelberg, Germany. b.mersch@dkfz.de

ABSTRACT

Background: Exonic splicing enhancers (ESEs) activate nearby splice sites and promote the inclusion (vs. exclusion) of exons in which they reside, while being a binding site for SR proteins. To study the impact of ESEs on alternative splicing it would be useful to have a possibility to detect them in exons. Identifying SR protein-binding sites in human DNA sequences by machine learning techniques is a formidable task, since the exon sequences are also constrained by their functional role in coding for proteins.

Results: The choice of training examples needed for machine learning approaches is difficult since there are only few exact locations of human ESEs described in the literature which could be considered as positive examples. Additionally, it is unclear which sequences are suitable as negative examples. Therefore, we developed a motif-oriented data-extraction method that extracts exon sequences around experimentally or theoretically determined ESE patterns. Positive examples are restricted by heuristics based on known properties of ESEs, e.g. location in the vicinity of a splice site, whereas negative examples are taken in the same way from the middle of long exons. We show that a suitably chosen SVM using optimized sequence kernels (e.g., combined oligo kernel) can extract meaningful properties from these training examples. Once the classifier is trained, every potential ESE sequence can be passed to the SVM for verification. Using SVMs with the combined oligo kernel yields a high accuracy of about 90 percent and well interpretable parameters.

Conclusion: The motif-oriented data-extraction method seems to produce consistent training and test data leading to good classification rates and thus allows verification of potential ESE motifs. The best results were obtained using an SVM with the combined oligo kernel, while oligo kernels with oligomers of a certain length could be used to extract relevant features.

Show MeSH