Limits...
Improving the Caenorhabditis elegans genome annotation using machine learning.

Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer RJ, Schölkopf B - PLoS Comput. Biol. (2006)

Bottom Line: We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions.The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter.We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

View Article: PubMed Central - PubMed

Affiliation: Friedrich Miescher Laboratory, Max Planck Society, Tübingen, Germany. Gunnar.Raetsch@tuebingen.mpg.de

ABSTRACT
For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

Show MeSH

Related in: MedlinePlus

Given the Start of the First and the End of the Last Exon, Our System (mSplicer) First Scans the Sequence Using SVM Detectors Trained To Recognize Donor (SVMGY) and Acceptor (SVMAG) Splice SitesThe detectors assign a score to each candidate site, shown below the sequence. In combination with additional information including outputs of SVMs recognizing exon/intron content, and scores for exon/intron lengths (unpublished data), these splice site scores contribute to the cumulative score for a putative splicing isoform. The bottom graph (step 2) illustrates the computation of the cumulative scores for two splicing isoforms, where the score at end of the sequence is the final score of the isoform. The contributions of the individual detector outputs, lengths of segments, as well as properties of the segments to the score are adjusted during training. They are optimized such that the margin between the true splicing isoform (shown in blue) and all other (wrong) isoforms (one of them is shown in red) is maximized. Prediction of new sequences works by selecting the splicing isoform with the maximum cumulative score. This can be implemented using dynamic programming related to decoding generalized HMMs 12, which also allows one to enforce certain constraints on the isoform (e.g., an open reading frame).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1808025&req=5

pcbi-0030020-g003: Given the Start of the First and the End of the Last Exon, Our System (mSplicer) First Scans the Sequence Using SVM Detectors Trained To Recognize Donor (SVMGY) and Acceptor (SVMAG) Splice SitesThe detectors assign a score to each candidate site, shown below the sequence. In combination with additional information including outputs of SVMs recognizing exon/intron content, and scores for exon/intron lengths (unpublished data), these splice site scores contribute to the cumulative score for a putative splicing isoform. The bottom graph (step 2) illustrates the computation of the cumulative scores for two splicing isoforms, where the score at end of the sequence is the final score of the isoform. The contributions of the individual detector outputs, lengths of segments, as well as properties of the segments to the score are adjusted during training. They are optimized such that the margin between the true splicing isoform (shown in blue) and all other (wrong) isoforms (one of them is shown in red) is maximized. Prediction of new sequences works by selecting the splicing isoform with the maximum cumulative score. This can be implemented using dynamic programming related to decoding generalized HMMs 12, which also allows one to enforce certain constraints on the isoform (e.g., an open reading frame).

Mentions: SVMs have their mathematical foundations in a statistical theory of learning and attempt to discriminate two classes by separating them with a large margin (“margin maximization”). SVMs are trained by solving an optimization problem (Figure 1) involving labeled training examples—true splice sites (positive) and decoys (negative). They employ similarity measures referred to as kernels that are designed for the classification task at hand. In our case, the kernels compare pairs of sequences in terms of their matching substring motifs [9,11,12] as illustrated in Figure 2 (cf. Material and Methods for more details). The idea of our algorithm is to first scan the unspliced mRNA using the SVM-based splice site detectors. In a second step, their predictions are combined to form the overall splicing prediction (cf. Figure 3 as well as Materials and Methods for details). This is implemented using a state-based system similar to standard hidden Markov model (HMM)–based gene-finding approaches [13–18]. We consider two different models: the simpler model implements the general rule that the start of the sequence is followed by a number (≥0) of donor and acceptor splice site pairs (5′ and 3′ ends of the intron) before the sequence ends (cf. Figure 4). If, moreover, one assumes the start and end of the coding region to be given, one can exploit that the spliced sequence consists of a string of non-stop codons terminated by a stop codon (TAA, TAG, TGA). In this case, the sum of the lengths of the coding parts of exons is divisible by three and the sequence does not contain in-frame stop codons. This can be translated into an alternative, more sophisticated model (cf. Figure 5) that is expected to perform better on coding regions, and may provide false predictions otherwise. The simpler model, on the other hand, is also applicable to untranslated regions (UTR); if in doubt, one should thus resort to this model.


Improving the Caenorhabditis elegans genome annotation using machine learning.

Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer RJ, Schölkopf B - PLoS Comput. Biol. (2006)

Given the Start of the First and the End of the Last Exon, Our System (mSplicer) First Scans the Sequence Using SVM Detectors Trained To Recognize Donor (SVMGY) and Acceptor (SVMAG) Splice SitesThe detectors assign a score to each candidate site, shown below the sequence. In combination with additional information including outputs of SVMs recognizing exon/intron content, and scores for exon/intron lengths (unpublished data), these splice site scores contribute to the cumulative score for a putative splicing isoform. The bottom graph (step 2) illustrates the computation of the cumulative scores for two splicing isoforms, where the score at end of the sequence is the final score of the isoform. The contributions of the individual detector outputs, lengths of segments, as well as properties of the segments to the score are adjusted during training. They are optimized such that the margin between the true splicing isoform (shown in blue) and all other (wrong) isoforms (one of them is shown in red) is maximized. Prediction of new sequences works by selecting the splicing isoform with the maximum cumulative score. This can be implemented using dynamic programming related to decoding generalized HMMs 12, which also allows one to enforce certain constraints on the isoform (e.g., an open reading frame).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1808025&req=5

pcbi-0030020-g003: Given the Start of the First and the End of the Last Exon, Our System (mSplicer) First Scans the Sequence Using SVM Detectors Trained To Recognize Donor (SVMGY) and Acceptor (SVMAG) Splice SitesThe detectors assign a score to each candidate site, shown below the sequence. In combination with additional information including outputs of SVMs recognizing exon/intron content, and scores for exon/intron lengths (unpublished data), these splice site scores contribute to the cumulative score for a putative splicing isoform. The bottom graph (step 2) illustrates the computation of the cumulative scores for two splicing isoforms, where the score at end of the sequence is the final score of the isoform. The contributions of the individual detector outputs, lengths of segments, as well as properties of the segments to the score are adjusted during training. They are optimized such that the margin between the true splicing isoform (shown in blue) and all other (wrong) isoforms (one of them is shown in red) is maximized. Prediction of new sequences works by selecting the splicing isoform with the maximum cumulative score. This can be implemented using dynamic programming related to decoding generalized HMMs 12, which also allows one to enforce certain constraints on the isoform (e.g., an open reading frame).
Mentions: SVMs have their mathematical foundations in a statistical theory of learning and attempt to discriminate two classes by separating them with a large margin (“margin maximization”). SVMs are trained by solving an optimization problem (Figure 1) involving labeled training examples—true splice sites (positive) and decoys (negative). They employ similarity measures referred to as kernels that are designed for the classification task at hand. In our case, the kernels compare pairs of sequences in terms of their matching substring motifs [9,11,12] as illustrated in Figure 2 (cf. Material and Methods for more details). The idea of our algorithm is to first scan the unspliced mRNA using the SVM-based splice site detectors. In a second step, their predictions are combined to form the overall splicing prediction (cf. Figure 3 as well as Materials and Methods for details). This is implemented using a state-based system similar to standard hidden Markov model (HMM)–based gene-finding approaches [13–18]. We consider two different models: the simpler model implements the general rule that the start of the sequence is followed by a number (≥0) of donor and acceptor splice site pairs (5′ and 3′ ends of the intron) before the sequence ends (cf. Figure 4). If, moreover, one assumes the start and end of the coding region to be given, one can exploit that the spliced sequence consists of a string of non-stop codons terminated by a stop codon (TAA, TAG, TGA). In this case, the sum of the lengths of the coding parts of exons is divisible by three and the sequence does not contain in-frame stop codons. This can be translated into an alternative, more sophisticated model (cf. Figure 5) that is expected to perform better on coding regions, and may provide false predictions otherwise. The simpler model, on the other hand, is also applicable to untranslated regions (UTR); if in doubt, one should thus resort to this model.

Bottom Line: We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions.The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter.We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

View Article: PubMed Central - PubMed

Affiliation: Friedrich Miescher Laboratory, Max Planck Society, Tübingen, Germany. Gunnar.Raetsch@tuebingen.mpg.de

ABSTRACT
For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

Show MeSH
Related in: MedlinePlus