Limits...
Prediction of MicroRNA Precursors Using Parsimonious Feature Sets.

Stepanowsky P, Levy E, Kim J, Jiang X, Ohno-Machado L - Cancer Inform (2014)

Bottom Line: However, no study has systematically compared published feature sets.We found that classifiers containing as few as seven highly predictive features are able to predict novel precursor miRNAs as well as classifiers that use larger feature sets.In a real data set, our method correctly identified the holdout miRNAs relevant to renal cancer.

View Article: PubMed Central - PubMed

Affiliation: Bioinformatics Research Group, University of Applied Sciences, Upper Austria, Hagenberg, Austria.

ABSTRACT
MicroRNAs (miRNAs) are a class of short noncoding RNAs that regulate gene expression through base pairing with messenger RNAs. Due to the interest in studying miRNA dysregulation in disease and limits of validated miRNA references, identification of novel miRNAs is a critical task. The performance of different models to predict novel miRNAs varies with the features chosen as predictors. However, no study has systematically compared published feature sets. We constructed a comprehensive feature set using the minimum free energy of the secondary structure of precursor miRNAs, a set of nucleotide-structure triplets, and additional extracted sequence and structure characteristics. We then compared the predictive value of our comprehensive feature set to those from three previously published studies, using logistic regression and random forest classifiers. We found that classifiers containing as few as seven highly predictive features are able to predict novel precursor miRNAs as well as classifiers that use larger feature sets. In a real data set, our method correctly identified the holdout miRNAs relevant to renal cancer.

No MeSH data available.


Related in: MedlinePlus

AUC values versus number of features sequentially added based on RELIEF score for RF classifiers in a 10-fold cross-validation.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4216048&req=5

f6-cin-suppl.1-2014-095: AUC values versus number of features sequentially added based on RELIEF score for RF classifiers in a 10-fold cross-validation.

Mentions: The performance of the two classifier models with 10-fold cross-validation is illustrated in Table 3 and Figure 4. For classifier evaluation, we used the area under the ROC curve (AUC) as it is a well-known aggregated discrimination performance measure that does not commit to a particular threshold.32 Our feature set combination shows consistently higher AUC values than the feature sets of Xue et al.15 and comparable AUC values to those published by Jiang et al.16 and Zhao et al.17 in both types of classifiers (LR and RF). We also used a holdout set of 20% of the data for external validation. The AUC values resulting from the use of the classifiers on this unseen data were slightly higher (Table 4) than they were in the training set, indicating low likelihood of over-fitting. Figure 4 shows the estimated AUC values for each feature set applied to the LR and RF models. Given the similarity in discrimination among the last three methods–Jiang, Zhao, and the proposed method–we investigated to identify the common features across all three methods that are most critical for prediction. Based on the RELIEF score, we selected the top seven common features and trained LR and RF. The “minimal” classifiers reached the same performance level as the ones with features more than seven, with AUC values of 0.9824 and 0.9796 for LR and RF, respectively, on the validation set. In addition, we confirmed the robustness of the RELIEF selections by sequentially adding features onto a RF classifier based on the RELIEF scores (Fig. 6).


Prediction of MicroRNA Precursors Using Parsimonious Feature Sets.

Stepanowsky P, Levy E, Kim J, Jiang X, Ohno-Machado L - Cancer Inform (2014)

AUC values versus number of features sequentially added based on RELIEF score for RF classifiers in a 10-fold cross-validation.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4216048&req=5

f6-cin-suppl.1-2014-095: AUC values versus number of features sequentially added based on RELIEF score for RF classifiers in a 10-fold cross-validation.
Mentions: The performance of the two classifier models with 10-fold cross-validation is illustrated in Table 3 and Figure 4. For classifier evaluation, we used the area under the ROC curve (AUC) as it is a well-known aggregated discrimination performance measure that does not commit to a particular threshold.32 Our feature set combination shows consistently higher AUC values than the feature sets of Xue et al.15 and comparable AUC values to those published by Jiang et al.16 and Zhao et al.17 in both types of classifiers (LR and RF). We also used a holdout set of 20% of the data for external validation. The AUC values resulting from the use of the classifiers on this unseen data were slightly higher (Table 4) than they were in the training set, indicating low likelihood of over-fitting. Figure 4 shows the estimated AUC values for each feature set applied to the LR and RF models. Given the similarity in discrimination among the last three methods–Jiang, Zhao, and the proposed method–we investigated to identify the common features across all three methods that are most critical for prediction. Based on the RELIEF score, we selected the top seven common features and trained LR and RF. The “minimal” classifiers reached the same performance level as the ones with features more than seven, with AUC values of 0.9824 and 0.9796 for LR and RF, respectively, on the validation set. In addition, we confirmed the robustness of the RELIEF selections by sequentially adding features onto a RF classifier based on the RELIEF scores (Fig. 6).

Bottom Line: However, no study has systematically compared published feature sets.We found that classifiers containing as few as seven highly predictive features are able to predict novel precursor miRNAs as well as classifiers that use larger feature sets.In a real data set, our method correctly identified the holdout miRNAs relevant to renal cancer.

View Article: PubMed Central - PubMed

Affiliation: Bioinformatics Research Group, University of Applied Sciences, Upper Austria, Hagenberg, Austria.

ABSTRACT
MicroRNAs (miRNAs) are a class of short noncoding RNAs that regulate gene expression through base pairing with messenger RNAs. Due to the interest in studying miRNA dysregulation in disease and limits of validated miRNA references, identification of novel miRNAs is a critical task. The performance of different models to predict novel miRNAs varies with the features chosen as predictors. However, no study has systematically compared published feature sets. We constructed a comprehensive feature set using the minimum free energy of the secondary structure of precursor miRNAs, a set of nucleotide-structure triplets, and additional extracted sequence and structure characteristics. We then compared the predictive value of our comprehensive feature set to those from three previously published studies, using logistic regression and random forest classifiers. We found that classifiers containing as few as seven highly predictive features are able to predict novel precursor miRNAs as well as classifiers that use larger feature sets. In a real data set, our method correctly identified the holdout miRNAs relevant to renal cancer.

No MeSH data available.


Related in: MedlinePlus