Limits...
Prediction of mature microRNA and piwi-interacting RNA without a genome reference or precursors.

Menor MS, Baek K, Poisson G - Int J Mol Sci (2015)

Bottom Line: In this study, we propose an approach that relies primarily on the nucleotide composition of the read and does not require reference genomes of related species for prediction.We found that the usage of an L1-based Gaussian kernel can double the true positive rate compared to the standard L2-based Gaussian kernel.Our approach can increase the true positive rate by at most 60% compared to the existing piRNA predictor based on the analysis of a hold-out test set.

View Article: PubMed Central - PubMed

Affiliation: Department of Information and Computer Sciences, University of Hawaii at Mānoa, 1680 East-West Road, Honolulu, HI 96822, USA. mmenor@hawaii.edu.

ABSTRACT
The discovery of novel microRNA (miRNA) and piwi-interacting RNA (piRNA) is an important task for the understanding of many biological processes. Most of the available miRNA and piRNA identification methods are dependent on the availability of the organism's genome sequence and the quality of its annotation. Therefore, an efficient prediction method based solely on the short RNA reads and requiring no genomic information is highly desirable. In this study, we propose an approach that relies primarily on the nucleotide composition of the read and does not require reference genomes of related species for prediction. Using an empirical Bayesian kernel method and the error correcting output codes framework, compact models suitable for large-scale analyses are built on databases of known mature miRNAs and piRNAs. We found that the usage of an L1-based Gaussian kernel can double the true positive rate compared to the standard L2-based Gaussian kernel. Our approach can increase the true positive rate by at most 60% compared to the existing piRNA predictor based on the analysis of a hold-out test set. Using experimental data, we also show that our approach can detect about an order of magnitude or more known miRNAs than the mature miRNA predictor, miRPlex.

Show MeSH
Three-fold cross-validation ROC curves for CFS McRUMs using L1 and L2 Gaussian kernels with all-pairs (AP) or one vs. rest (OVR) decompositions. The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01. (a) Results for microRNAs; (b) Results for piwi-interacting RNAs; (c) Results for other types of small RNAs.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4307313&req=5

ijms-16-01466-f002: Three-fold cross-validation ROC curves for CFS McRUMs using L1 and L2 Gaussian kernels with all-pairs (AP) or one vs. rest (OVR) decompositions. The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01. (a) Results for microRNAs; (b) Results for piwi-interacting RNAs; (c) Results for other types of small RNAs.

Mentions: We compare the use of L1 and the standard L2 Gaussian kernels with three-fold cross-validation using CFS-selected features. Figure 2 shows the significant increase of performance from using L1 rather than the standard L2 Gaussian kernel. For example, there is an increase to about a 0.9 true positive rate (TPR) using L1 from about 0.4 using L2 at a 0.01 false positive rate (FPR) for piRNA prediction (Figure 2b). Aggarwal et al. [19] studied the behavior of distance metrics in high dimensional clustering that explains the results we observe. They define a concept called the relative contrast, which measures how well you can discriminate between the nearest and furthest neighbor when using a particular distance metric. If relative contrast is too low, all points appear to be equally close by, which would be disastrous in classification, where the locality of data points is crucial. They go on to show that relative contrast using the L1 distance decays at a slower rate than the L2 distance as the dimension increases, and therefore, the L1 distance remains a viable metric at larger dimensions than the L2 distance. These theoretical results agree with our observations that the standard L2 Gaussian kernel evaluations in our problem are all near one, indicating low relative contrast, as all points in the dataset appear to be equally distant from (or similar to) each other. The increased relative contrast offered by using the L1 distance allows McRUM to more easily distinguish the data points belonging to different classes and, thus, improves predictive performance.


Prediction of mature microRNA and piwi-interacting RNA without a genome reference or precursors.

Menor MS, Baek K, Poisson G - Int J Mol Sci (2015)

Three-fold cross-validation ROC curves for CFS McRUMs using L1 and L2 Gaussian kernels with all-pairs (AP) or one vs. rest (OVR) decompositions. The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01. (a) Results for microRNAs; (b) Results for piwi-interacting RNAs; (c) Results for other types of small RNAs.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4307313&req=5

ijms-16-01466-f002: Three-fold cross-validation ROC curves for CFS McRUMs using L1 and L2 Gaussian kernels with all-pairs (AP) or one vs. rest (OVR) decompositions. The ROC curves are generated from observed FPR and TPR under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01. (a) Results for microRNAs; (b) Results for piwi-interacting RNAs; (c) Results for other types of small RNAs.
Mentions: We compare the use of L1 and the standard L2 Gaussian kernels with three-fold cross-validation using CFS-selected features. Figure 2 shows the significant increase of performance from using L1 rather than the standard L2 Gaussian kernel. For example, there is an increase to about a 0.9 true positive rate (TPR) using L1 from about 0.4 using L2 at a 0.01 false positive rate (FPR) for piRNA prediction (Figure 2b). Aggarwal et al. [19] studied the behavior of distance metrics in high dimensional clustering that explains the results we observe. They define a concept called the relative contrast, which measures how well you can discriminate between the nearest and furthest neighbor when using a particular distance metric. If relative contrast is too low, all points appear to be equally close by, which would be disastrous in classification, where the locality of data points is crucial. They go on to show that relative contrast using the L1 distance decays at a slower rate than the L2 distance as the dimension increases, and therefore, the L1 distance remains a viable metric at larger dimensions than the L2 distance. These theoretical results agree with our observations that the standard L2 Gaussian kernel evaluations in our problem are all near one, indicating low relative contrast, as all points in the dataset appear to be equally distant from (or similar to) each other. The increased relative contrast offered by using the L1 distance allows McRUM to more easily distinguish the data points belonging to different classes and, thus, improves predictive performance.

Bottom Line: In this study, we propose an approach that relies primarily on the nucleotide composition of the read and does not require reference genomes of related species for prediction.We found that the usage of an L1-based Gaussian kernel can double the true positive rate compared to the standard L2-based Gaussian kernel.Our approach can increase the true positive rate by at most 60% compared to the existing piRNA predictor based on the analysis of a hold-out test set.

View Article: PubMed Central - PubMed

Affiliation: Department of Information and Computer Sciences, University of Hawaii at Mānoa, 1680 East-West Road, Honolulu, HI 96822, USA. mmenor@hawaii.edu.

ABSTRACT
The discovery of novel microRNA (miRNA) and piwi-interacting RNA (piRNA) is an important task for the understanding of many biological processes. Most of the available miRNA and piRNA identification methods are dependent on the availability of the organism's genome sequence and the quality of its annotation. Therefore, an efficient prediction method based solely on the short RNA reads and requiring no genomic information is highly desirable. In this study, we propose an approach that relies primarily on the nucleotide composition of the read and does not require reference genomes of related species for prediction. Using an empirical Bayesian kernel method and the error correcting output codes framework, compact models suitable for large-scale analyses are built on databases of known mature miRNAs and piRNAs. We found that the usage of an L1-based Gaussian kernel can double the true positive rate compared to the standard L2-based Gaussian kernel. Our approach can increase the true positive rate by at most 60% compared to the existing piRNA predictor based on the analysis of a hold-out test set. Using experimental data, we also show that our approach can detect about an order of magnitude or more known miRNAs than the mature miRNA predictor, miRPlex.

Show MeSH