Quantifying sequence and structural features of protein-RNA interactions.
Bottom Line: Several novel and modified features enhanced the accuracy of residue-level RNA-binding propensity beyond what has been reported previously, including by meta-prediction servers.These features include: hidden Markov model-based evolutionary conservation, surface deformations based on the Laplacian norm formalism, and relative solvent accessibility partitioned into backbone and side chain contributions.We constructed a web server called aaRNA that implements the proposed method and demonstrate its use in identifying putative RNA binding sites.
Affiliation: Laboratory of Systems Immunology, WPI Immunology Frontier Research Center, Osaka University, Osaka 565-0871, Japan firstname.lastname@example.org.Show MeSH
Mentions: After combining the above-mentioned sequence (including the terms used in the control method) and structural features, we compared the performance of our model with that using only the sequence-based control method. The number of columns for each kind of feature and the size of fragment window used (either a sequential window or a spatial window) are summarized in Table 1. A performance summary of individual features and all features combined together can be found in Table 2. Finally for each coding fragment window, a 668-column feature vector was used. We found that our novel feature-coding scheme could significantly increase the prediction performance not only in terms of the AUC but also in terms of PR measurements for both datasets, as shown in Figure 2 (non-ribosomal and full ROC and PR curves for binary prediction) and Supplementary Figures S10 and S11 (di-nucleotide curves for non-ribosomal and full sets). Importantly, contributions from different features were approximately additive, resulting in an increase in the AUC of 6.8 and 4.1% for the non-ribosomal and full sets, respectively, which indicates a low redundancy between features. The new features that contributed the most information were the LN and the normalized ASA. When using the EC feature instead of the PSSM feature, the performance decreased only slightly (data not shown). This is a non-trivial result as the PSSM feature requires 20 columns for each residue while the EC feature requires just one. Therefore the novel features described here are both information-rich and relatively compact. From the PR plots, we can see that, when training with the non-ribosomal dataset, the highest precision approached only 0.7 when sensitivity was extremely low. In contrast, the sensitivity values estimated from the complete dataset under a precision rate of 0.7 approached 1 (P-value of t-test < 1e-12). In contrast, with the full dataset, which contains twice the number of RNA binding sites, precision approached 1.0 at low sensitivity. Therefore, the neural network could learn from ribosomal proteins in the full dataset and achieve a better prediction even on non-ribosomal complexes. This implies that, although ribosomal and non-ribosomal proteins may be different in their RNA binding modes, they apparently share common features as well. We only considered RNA-binding residues under a 3.5 Å distance cutoff within the same BU as ‘true’; consequently, non-binding residues made up 89.4% of our RB205 dataset. This ratio was ∼85% when using a 5 Å distance cutoff for RNA-binding residues. In the best performing model (average sensitivity and specificity 0.775, sensitivity 0.763 and specificity 0.787) on the RB205 dataset, non-binding residues were predicted to be 72.9%.
Affiliation: Laboratory of Systems Immunology, WPI Immunology Frontier Research Center, Osaka University, Osaka 565-0871, Japan email@example.com.