Limits...
Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine.

Mao R, Raj Kumar PK, Guo C, Zhang Y, Liang C - PLoS ONE (2014)

Bottom Line: We demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs from CSIs in comparison with other four approaches.Not only the basic sequence features and positional distribution characteristics of RIs were obtained, but also putative regulatory motifs in intron splicing were predicted based on our feature extraction approach.Clearly, our study will facilitate a better understanding of underlying mechanisms involved in intron retention.

View Article: PubMed Central - PubMed

Affiliation: College of Mechanical and Electronic Engineering, Northwest A&F University, Yangling, Shaanxi, China; College of Information Engineering, Northwest A&F University, Yangling, Shaanxi, China; Department of Biology, Miami University, Oxford, Ohio, United States of America.

ABSTRACT
One of the important modes of pre-mRNA post-transcriptional modification is alternative splicing. Alternative splicing allows creation of many distinct mature mRNA transcripts from a single gene by utilizing different splice sites. In plants like Arabidopsis thaliana, the most common type of alternative splicing is intron retention. Many studies in the past focus on positional distribution of retained introns (RIs) among different genic regions and their expression regulations, while little systematic classification of RIs from constitutively spliced introns (CSIs) has been conducted using machine learning approaches. We used random forest and support vector machine (SVM) with radial basis kernel function (RBF) to differentiate these two types of introns in Arabidopsis. By comparing coordinates of introns of all annotated mRNAs from TAIR10, we obtained our high-quality experimental data. To distinguish RIs from CSIs, We investigated the unique characteristics of RIs in comparison with CSIs and finally extracted 37 quantitative features: local and global nucleotide sequence features of introns, frequent motifs, the signal strength of splice sites, and the similarity between sequences of introns and their flanking regions. We demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs from CSIs in comparison with other four approaches. The optimal penalty parameter C and the RBF kernel parameter [Formula: see text] in SVM were set based on particle swarm optimization algorithm (PSOSVM). Our classification performance showed F-Measure of 80.8% (random forest) and 77.4% (PSOSVM). Not only the basic sequence features and positional distribution characteristics of RIs were obtained, but also putative regulatory motifs in intron splicing were predicted based on our feature extraction approach. Clearly, our study will facilitate a better understanding of underlying mechanisms involved in intron retention.

Show MeSH
Numbers of various RNA types annotated in TAIR10 gene annotation for Arabidopsis.Each horizontal bar (with the number) indicates the number for a given RNA type.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4128822&req=5

pone-0104049-g003: Numbers of various RNA types annotated in TAIR10 gene annotation for Arabidopsis.Each horizontal bar (with the number) indicates the number for a given RNA type.

Mentions: In TAIR10 gene annotation for Arabidopsis, there are 28,775 genes, 3,903 transposable element genes and 924 pseudogenes. All these genes except pseudogenes have been used for further analysis, and they have a total of 40,745 annotated RNAs, which can be categorized into 8 different RNA types (Figure 3). It is clear from Figure 3 that most of the annotated RNAs are mRNAs (86.85%, 35,386 out of 40,745). As shown in Table 2, we found a total of 2,811 RIs and 113,098 CSIs in Arabidopsis. Interestingly, no RI was detected in chloroplast (ChrC) and mitochondrion (ChrM) while only 42 CSIs cases were found in these organelle genomes. For the 8 different RNA types, both RIs (98.26%, 2,762 out of 2,811) and CSIs (97.53%, 110,304 out of 113,098) were detected overwhelmingly in mRNAs whereas they (RIs: 1.74%, 49 out of 2,811 and CSIs: 2.47%, 2,794 out of 113,098) were rarely discovered among other 7 RNA types. Therefore, all the RIs (2,762) and CSIs (110,262 = 110,304-42) detected in mRNAs within chromosomes Chr1–Chr5 constituted our data set for downstream analysis.


Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine.

Mao R, Raj Kumar PK, Guo C, Zhang Y, Liang C - PLoS ONE (2014)

Numbers of various RNA types annotated in TAIR10 gene annotation for Arabidopsis.Each horizontal bar (with the number) indicates the number for a given RNA type.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4128822&req=5

pone-0104049-g003: Numbers of various RNA types annotated in TAIR10 gene annotation for Arabidopsis.Each horizontal bar (with the number) indicates the number for a given RNA type.
Mentions: In TAIR10 gene annotation for Arabidopsis, there are 28,775 genes, 3,903 transposable element genes and 924 pseudogenes. All these genes except pseudogenes have been used for further analysis, and they have a total of 40,745 annotated RNAs, which can be categorized into 8 different RNA types (Figure 3). It is clear from Figure 3 that most of the annotated RNAs are mRNAs (86.85%, 35,386 out of 40,745). As shown in Table 2, we found a total of 2,811 RIs and 113,098 CSIs in Arabidopsis. Interestingly, no RI was detected in chloroplast (ChrC) and mitochondrion (ChrM) while only 42 CSIs cases were found in these organelle genomes. For the 8 different RNA types, both RIs (98.26%, 2,762 out of 2,811) and CSIs (97.53%, 110,304 out of 113,098) were detected overwhelmingly in mRNAs whereas they (RIs: 1.74%, 49 out of 2,811 and CSIs: 2.47%, 2,794 out of 113,098) were rarely discovered among other 7 RNA types. Therefore, all the RIs (2,762) and CSIs (110,262 = 110,304-42) detected in mRNAs within chromosomes Chr1–Chr5 constituted our data set for downstream analysis.

Bottom Line: We demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs from CSIs in comparison with other four approaches.Not only the basic sequence features and positional distribution characteristics of RIs were obtained, but also putative regulatory motifs in intron splicing were predicted based on our feature extraction approach.Clearly, our study will facilitate a better understanding of underlying mechanisms involved in intron retention.

View Article: PubMed Central - PubMed

Affiliation: College of Mechanical and Electronic Engineering, Northwest A&F University, Yangling, Shaanxi, China; College of Information Engineering, Northwest A&F University, Yangling, Shaanxi, China; Department of Biology, Miami University, Oxford, Ohio, United States of America.

ABSTRACT
One of the important modes of pre-mRNA post-transcriptional modification is alternative splicing. Alternative splicing allows creation of many distinct mature mRNA transcripts from a single gene by utilizing different splice sites. In plants like Arabidopsis thaliana, the most common type of alternative splicing is intron retention. Many studies in the past focus on positional distribution of retained introns (RIs) among different genic regions and their expression regulations, while little systematic classification of RIs from constitutively spliced introns (CSIs) has been conducted using machine learning approaches. We used random forest and support vector machine (SVM) with radial basis kernel function (RBF) to differentiate these two types of introns in Arabidopsis. By comparing coordinates of introns of all annotated mRNAs from TAIR10, we obtained our high-quality experimental data. To distinguish RIs from CSIs, We investigated the unique characteristics of RIs in comparison with CSIs and finally extracted 37 quantitative features: local and global nucleotide sequence features of introns, frequent motifs, the signal strength of splice sites, and the similarity between sequences of introns and their flanking regions. We demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs from CSIs in comparison with other four approaches. The optimal penalty parameter C and the RBF kernel parameter [Formula: see text] in SVM were set based on particle swarm optimization algorithm (PSOSVM). Our classification performance showed F-Measure of 80.8% (random forest) and 77.4% (PSOSVM). Not only the basic sequence features and positional distribution characteristics of RIs were obtained, but also putative regulatory motifs in intron splicing were predicted based on our feature extraction approach. Clearly, our study will facilitate a better understanding of underlying mechanisms involved in intron retention.

Show MeSH