Limits...
Improving the Caenorhabditis elegans genome annotation using machine learning.

Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer RJ, Schölkopf B - PLoS Comput. Biol. (2006)

Bottom Line: We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions.The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter.We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

View Article: PubMed Central - PubMed

Affiliation: Friedrich Miescher Laboratory, Max Planck Society, Tübingen, Germany. Gunnar.Raetsch@tuebingen.mpg.de

ABSTRACT
For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

Show MeSH
Given Two Sequences, s1 and s2 of Equal Length, Our Kernel Consists of a Weighted Sum to Which Each Match in the Sequences Makes a Contribution wl Depending on Its Length l, Where Longer Matches Contribute More SignificantlyFor predictions, we use a window of 140 nt around the potential splice site (cf. Materials and Methods for details, including the procedure of how the length of the window is determined).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1808025&req=5

pcbi-0030020-g002: Given Two Sequences, s1 and s2 of Equal Length, Our Kernel Consists of a Weighted Sum to Which Each Match in the Sequences Makes a Contribution wl Depending on Its Length l, Where Longer Matches Contribute More SignificantlyFor predictions, we use a window of 140 nt around the potential splice site (cf. Materials and Methods for details, including the procedure of how the length of the window is determined).

Mentions: SVMs have their mathematical foundations in a statistical theory of learning and attempt to discriminate two classes by separating them with a large margin (“margin maximization”). SVMs are trained by solving an optimization problem (Figure 1) involving labeled training examples—true splice sites (positive) and decoys (negative). They employ similarity measures referred to as kernels that are designed for the classification task at hand. In our case, the kernels compare pairs of sequences in terms of their matching substring motifs [9,11,12] as illustrated in Figure 2 (cf. Material and Methods for more details). The idea of our algorithm is to first scan the unspliced mRNA using the SVM-based splice site detectors. In a second step, their predictions are combined to form the overall splicing prediction (cf. Figure 3 as well as Materials and Methods for details). This is implemented using a state-based system similar to standard hidden Markov model (HMM)–based gene-finding approaches [13–18]. We consider two different models: the simpler model implements the general rule that the start of the sequence is followed by a number (≥0) of donor and acceptor splice site pairs (5′ and 3′ ends of the intron) before the sequence ends (cf. Figure 4). If, moreover, one assumes the start and end of the coding region to be given, one can exploit that the spliced sequence consists of a string of non-stop codons terminated by a stop codon (TAA, TAG, TGA). In this case, the sum of the lengths of the coding parts of exons is divisible by three and the sequence does not contain in-frame stop codons. This can be translated into an alternative, more sophisticated model (cf. Figure 5) that is expected to perform better on coding regions, and may provide false predictions otherwise. The simpler model, on the other hand, is also applicable to untranslated regions (UTR); if in doubt, one should thus resort to this model.


Improving the Caenorhabditis elegans genome annotation using machine learning.

Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer RJ, Schölkopf B - PLoS Comput. Biol. (2006)

Given Two Sequences, s1 and s2 of Equal Length, Our Kernel Consists of a Weighted Sum to Which Each Match in the Sequences Makes a Contribution wl Depending on Its Length l, Where Longer Matches Contribute More SignificantlyFor predictions, we use a window of 140 nt around the potential splice site (cf. Materials and Methods for details, including the procedure of how the length of the window is determined).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1808025&req=5

pcbi-0030020-g002: Given Two Sequences, s1 and s2 of Equal Length, Our Kernel Consists of a Weighted Sum to Which Each Match in the Sequences Makes a Contribution wl Depending on Its Length l, Where Longer Matches Contribute More SignificantlyFor predictions, we use a window of 140 nt around the potential splice site (cf. Materials and Methods for details, including the procedure of how the length of the window is determined).
Mentions: SVMs have their mathematical foundations in a statistical theory of learning and attempt to discriminate two classes by separating them with a large margin (“margin maximization”). SVMs are trained by solving an optimization problem (Figure 1) involving labeled training examples—true splice sites (positive) and decoys (negative). They employ similarity measures referred to as kernels that are designed for the classification task at hand. In our case, the kernels compare pairs of sequences in terms of their matching substring motifs [9,11,12] as illustrated in Figure 2 (cf. Material and Methods for more details). The idea of our algorithm is to first scan the unspliced mRNA using the SVM-based splice site detectors. In a second step, their predictions are combined to form the overall splicing prediction (cf. Figure 3 as well as Materials and Methods for details). This is implemented using a state-based system similar to standard hidden Markov model (HMM)–based gene-finding approaches [13–18]. We consider two different models: the simpler model implements the general rule that the start of the sequence is followed by a number (≥0) of donor and acceptor splice site pairs (5′ and 3′ ends of the intron) before the sequence ends (cf. Figure 4). If, moreover, one assumes the start and end of the coding region to be given, one can exploit that the spliced sequence consists of a string of non-stop codons terminated by a stop codon (TAA, TAG, TGA). In this case, the sum of the lengths of the coding parts of exons is divisible by three and the sequence does not contain in-frame stop codons. This can be translated into an alternative, more sophisticated model (cf. Figure 5) that is expected to perform better on coding regions, and may provide false predictions otherwise. The simpler model, on the other hand, is also applicable to untranslated regions (UTR); if in doubt, one should thus resort to this model.

Bottom Line: We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions.The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter.We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

View Article: PubMed Central - PubMed

Affiliation: Friedrich Miescher Laboratory, Max Planck Society, Tübingen, Germany. Gunnar.Raetsch@tuebingen.mpg.de

ABSTRACT
For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.

Show MeSH