Limits...
Prediction of mature microRNA and piwi-interacting RNA without a genome reference or precursors.

Menor MS, Baek K, Poisson G - Int J Mol Sci (2015)

Bottom Line: In this study, we propose an approach that relies primarily on the nucleotide composition of the read and does not require reference genomes of related species for prediction.We found that the usage of an L1-based Gaussian kernel can double the true positive rate compared to the standard L2-based Gaussian kernel.Our approach can increase the true positive rate by at most 60% compared to the existing piRNA predictor based on the analysis of a hold-out test set.

View Article: PubMed Central - PubMed

Affiliation: Department of Information and Computer Sciences, University of Hawaii at Mānoa, 1680 East-West Road, Honolulu, HI 96822, USA. mmenor@hawaii.edu.

ABSTRACT
The discovery of novel microRNA (miRNA) and piwi-interacting RNA (piRNA) is an important task for the understanding of many biological processes. Most of the available miRNA and piRNA identification methods are dependent on the availability of the organism's genome sequence and the quality of its annotation. Therefore, an efficient prediction method based solely on the short RNA reads and requiring no genomic information is highly desirable. In this study, we propose an approach that relies primarily on the nucleotide composition of the read and does not require reference genomes of related species for prediction. Using an empirical Bayesian kernel method and the error correcting output codes framework, compact models suitable for large-scale analyses are built on databases of known mature miRNAs and piRNAs. We found that the usage of an L1-based Gaussian kernel can double the true positive rate compared to the standard L2-based Gaussian kernel. Our approach can increase the true positive rate by at most 60% compared to the existing piRNA predictor based on the analysis of a hold-out test set. Using experimental data, we also show that our approach can detect about an order of magnitude or more known miRNAs than the mature miRNA predictor, miRPlex.

Show MeSH
Three-fold cross-validation receiver operating characteristic (ROC) curve for correlation-based feature selection (CFS)-selected and all features (ALL) multiclass relevance units machine (McRUM) using the L1 Gaussian kernel with all-pairs (AP) or one vs. rest (OVR) decompositions. The ROC curves are generated from the observed false positive rate (FPR) and true positive rate (TPR) under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01. (a) Results for microRNAs; (b) Results for piwi-interacting RNAs; (c) Results for other types of small RNAs.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4307313&req=5

ijms-16-01466-f001: Three-fold cross-validation receiver operating characteristic (ROC) curve for correlation-based feature selection (CFS)-selected and all features (ALL) multiclass relevance units machine (McRUM) using the L1 Gaussian kernel with all-pairs (AP) or one vs. rest (OVR) decompositions. The ROC curves are generated from the observed false positive rate (FPR) and true positive rate (TPR) under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01. (a) Results for microRNAs; (b) Results for piwi-interacting RNAs; (c) Results for other types of small RNAs.

Mentions: Figure 1 illustrates the predictive performance of McRUMs with the L1 Gaussian kernel using all features or CFS-selected features. Two popular decompositions of a multiclass problem to a set of binary class problems are considered: all-pairs (AP) and one vs. rest (OVR). In the AP decomposition, a binary classifier is created for every pair of classes: (1) miRNA vs. piRNA; (2) miRNA vs. other ncRNA; and (3) piRNA vs. other ncRNA. On the other hand, the OVR decomposition creates a binary classifier for each class against all others: (1) miRNA vs. piRNA and other ncRNA; (2) piRNA vs. miRNA and other ncRNA; and (3) other ncRNA vs. miRNA and piRNA. The performance is measured using three-fold cross-validation. Figure 1b shows some advantages of using the OVR decomposition over AP. However, there is little difference between the models that use all features and CFS-selected features. This is surprising, as the order of the magnitude reduction of features and, thus, the dimensionality were expected to improve predictive performance, due to the relative contrast theory presented in [19] and since CFS should not be eliminating vital features. Nevertheless, the CFS analysis did provide insight into which features are actually necessary for prediction.


Prediction of mature microRNA and piwi-interacting RNA without a genome reference or precursors.

Menor MS, Baek K, Poisson G - Int J Mol Sci (2015)

Three-fold cross-validation receiver operating characteristic (ROC) curve for correlation-based feature selection (CFS)-selected and all features (ALL) multiclass relevance units machine (McRUM) using the L1 Gaussian kernel with all-pairs (AP) or one vs. rest (OVR) decompositions. The ROC curves are generated from the observed false positive rate (FPR) and true positive rate (TPR) under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01. (a) Results for microRNAs; (b) Results for piwi-interacting RNAs; (c) Results for other types of small RNAs.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4307313&req=5

ijms-16-01466-f001: Three-fold cross-validation receiver operating characteristic (ROC) curve for correlation-based feature selection (CFS)-selected and all features (ALL) multiclass relevance units machine (McRUM) using the L1 Gaussian kernel with all-pairs (AP) or one vs. rest (OVR) decompositions. The ROC curves are generated from the observed false positive rate (FPR) and true positive rate (TPR) under varying posterior probability thresholds from 0.30 to 0.99 in increments of 0.01. (a) Results for microRNAs; (b) Results for piwi-interacting RNAs; (c) Results for other types of small RNAs.
Mentions: Figure 1 illustrates the predictive performance of McRUMs with the L1 Gaussian kernel using all features or CFS-selected features. Two popular decompositions of a multiclass problem to a set of binary class problems are considered: all-pairs (AP) and one vs. rest (OVR). In the AP decomposition, a binary classifier is created for every pair of classes: (1) miRNA vs. piRNA; (2) miRNA vs. other ncRNA; and (3) piRNA vs. other ncRNA. On the other hand, the OVR decomposition creates a binary classifier for each class against all others: (1) miRNA vs. piRNA and other ncRNA; (2) piRNA vs. miRNA and other ncRNA; and (3) other ncRNA vs. miRNA and piRNA. The performance is measured using three-fold cross-validation. Figure 1b shows some advantages of using the OVR decomposition over AP. However, there is little difference between the models that use all features and CFS-selected features. This is surprising, as the order of the magnitude reduction of features and, thus, the dimensionality were expected to improve predictive performance, due to the relative contrast theory presented in [19] and since CFS should not be eliminating vital features. Nevertheless, the CFS analysis did provide insight into which features are actually necessary for prediction.

Bottom Line: In this study, we propose an approach that relies primarily on the nucleotide composition of the read and does not require reference genomes of related species for prediction.We found that the usage of an L1-based Gaussian kernel can double the true positive rate compared to the standard L2-based Gaussian kernel.Our approach can increase the true positive rate by at most 60% compared to the existing piRNA predictor based on the analysis of a hold-out test set.

View Article: PubMed Central - PubMed

Affiliation: Department of Information and Computer Sciences, University of Hawaii at Mānoa, 1680 East-West Road, Honolulu, HI 96822, USA. mmenor@hawaii.edu.

ABSTRACT
The discovery of novel microRNA (miRNA) and piwi-interacting RNA (piRNA) is an important task for the understanding of many biological processes. Most of the available miRNA and piRNA identification methods are dependent on the availability of the organism's genome sequence and the quality of its annotation. Therefore, an efficient prediction method based solely on the short RNA reads and requiring no genomic information is highly desirable. In this study, we propose an approach that relies primarily on the nucleotide composition of the read and does not require reference genomes of related species for prediction. Using an empirical Bayesian kernel method and the error correcting output codes framework, compact models suitable for large-scale analyses are built on databases of known mature miRNAs and piRNAs. We found that the usage of an L1-based Gaussian kernel can double the true positive rate compared to the standard L2-based Gaussian kernel. Our approach can increase the true positive rate by at most 60% compared to the existing piRNA predictor based on the analysis of a hold-out test set. Using experimental data, we also show that our approach can detect about an order of magnitude or more known miRNAs than the mature miRNA predictor, miRPlex.

Show MeSH