Limits...
Prediction of mature microRNA and piwi-interacting RNA without a genome reference or precursors.

Menor MS, Baek K, Poisson G - Int J Mol Sci (2015)

Bottom Line: In this study, we propose an approach that relies primarily on the nucleotide composition of the read and does not require reference genomes of related species for prediction.We found that the usage of an L1-based Gaussian kernel can double the true positive rate compared to the standard L2-based Gaussian kernel.Our approach can increase the true positive rate by at most 60% compared to the existing piRNA predictor based on the analysis of a hold-out test set.

View Article: PubMed Central - PubMed

Affiliation: Department of Information and Computer Sciences, University of Hawaii at Mānoa, 1680 East-West Road, Honolulu, HI 96822, USA. mmenor@hawaii.edu.

ABSTRACT
The discovery of novel microRNA (miRNA) and piwi-interacting RNA (piRNA) is an important task for the understanding of many biological processes. Most of the available miRNA and piRNA identification methods are dependent on the availability of the organism's genome sequence and the quality of its annotation. Therefore, an efficient prediction method based solely on the short RNA reads and requiring no genomic information is highly desirable. In this study, we propose an approach that relies primarily on the nucleotide composition of the read and does not require reference genomes of related species for prediction. Using an empirical Bayesian kernel method and the error correcting output codes framework, compact models suitable for large-scale analyses are built on databases of known mature miRNAs and piRNAs. We found that the usage of an L1-based Gaussian kernel can double the true positive rate compared to the standard L2-based Gaussian kernel. Our approach can increase the true positive rate by at most 60% compared to the existing piRNA predictor based on the analysis of a hold-out test set. Using experimental data, we also show that our approach can detect about an order of magnitude or more known miRNAs than the mature miRNA predictor, miRPlex.

Show MeSH
Test set ROC curves for all-pairs (AP) and one vs. rest (OVR) McRUMs and piRNApredictors: the original piRNApredictor (FLD1) (Fisher’s linear discriminant (FLD)) and a retrained piRNApredictor (FLD2). The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4307313&req=5

ijms-16-01466-f005: Test set ROC curves for all-pairs (AP) and one vs. rest (OVR) McRUMs and piRNApredictors: the original piRNApredictor (FLD1) (Fisher’s linear discriminant (FLD)) and a retrained piRNApredictor (FLD2). The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01.

Mentions: The predictive performance of McRUMs for the piRNA case compared to the software piRNApredictor [15] is given in Figure 5. PiRNApredictor uses Fisher’s linear discriminant (FLD) classifier and makes predictions based on k-mer composition. Two variations of piRNApredictor are examined: the original provided by the authors and one built on our training dataset. Both McRUMs significantly outperform the piRNApredictors. One reason for the low performance of piRNApredictor is its use of an FLD’s linear model of the decision boundary between piRNA and other ncRNAs. We also observed poor performance using a linear SVM in our preliminary analysis [16]. Also noted in our preliminary analysis was the small representation of ncRNA sequences under 25 nt long in piRNApredictor’s training dataset. Only 4.67% of piRNApredictor’s training dataset were ncRNAs under 25 nt long. This weak representation of short ncRNAs may be the reason why the piRNApredictor trained on our dataset outperforms the original.


Prediction of mature microRNA and piwi-interacting RNA without a genome reference or precursors.

Menor MS, Baek K, Poisson G - Int J Mol Sci (2015)

Test set ROC curves for all-pairs (AP) and one vs. rest (OVR) McRUMs and piRNApredictors: the original piRNApredictor (FLD1) (Fisher’s linear discriminant (FLD)) and a retrained piRNApredictor (FLD2). The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4307313&req=5

ijms-16-01466-f005: Test set ROC curves for all-pairs (AP) and one vs. rest (OVR) McRUMs and piRNApredictors: the original piRNApredictor (FLD1) (Fisher’s linear discriminant (FLD)) and a retrained piRNApredictor (FLD2). The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01.
Mentions: The predictive performance of McRUMs for the piRNA case compared to the software piRNApredictor [15] is given in Figure 5. PiRNApredictor uses Fisher’s linear discriminant (FLD) classifier and makes predictions based on k-mer composition. Two variations of piRNApredictor are examined: the original provided by the authors and one built on our training dataset. Both McRUMs significantly outperform the piRNApredictors. One reason for the low performance of piRNApredictor is its use of an FLD’s linear model of the decision boundary between piRNA and other ncRNAs. We also observed poor performance using a linear SVM in our preliminary analysis [16]. Also noted in our preliminary analysis was the small representation of ncRNA sequences under 25 nt long in piRNApredictor’s training dataset. Only 4.67% of piRNApredictor’s training dataset were ncRNAs under 25 nt long. This weak representation of short ncRNAs may be the reason why the piRNApredictor trained on our dataset outperforms the original.

Bottom Line: In this study, we propose an approach that relies primarily on the nucleotide composition of the read and does not require reference genomes of related species for prediction.We found that the usage of an L1-based Gaussian kernel can double the true positive rate compared to the standard L2-based Gaussian kernel.Our approach can increase the true positive rate by at most 60% compared to the existing piRNA predictor based on the analysis of a hold-out test set.

View Article: PubMed Central - PubMed

Affiliation: Department of Information and Computer Sciences, University of Hawaii at Mānoa, 1680 East-West Road, Honolulu, HI 96822, USA. mmenor@hawaii.edu.

ABSTRACT
The discovery of novel microRNA (miRNA) and piwi-interacting RNA (piRNA) is an important task for the understanding of many biological processes. Most of the available miRNA and piRNA identification methods are dependent on the availability of the organism's genome sequence and the quality of its annotation. Therefore, an efficient prediction method based solely on the short RNA reads and requiring no genomic information is highly desirable. In this study, we propose an approach that relies primarily on the nucleotide composition of the read and does not require reference genomes of related species for prediction. Using an empirical Bayesian kernel method and the error correcting output codes framework, compact models suitable for large-scale analyses are built on databases of known mature miRNAs and piRNAs. We found that the usage of an L1-based Gaussian kernel can double the true positive rate compared to the standard L2-based Gaussian kernel. Our approach can increase the true positive rate by at most 60% compared to the existing piRNA predictor based on the analysis of a hold-out test set. Using experimental data, we also show that our approach can detect about an order of magnitude or more known miRNAs than the mature miRNA predictor, miRPlex.

Show MeSH