Quantifying sequence and structural features of protein-RNA interactions.
Bottom Line: Several novel and modified features enhanced the accuracy of residue-level RNA-binding propensity beyond what has been reported previously, including by meta-prediction servers.These features include: hidden Markov model-based evolutionary conservation, surface deformations based on the Laplacian norm formalism, and relative solvent accessibility partitioned into backbone and side chain contributions.We constructed a web server called aaRNA that implements the proposed method and demonstrate its use in identifying putative RNA binding sites.
Affiliation: Laboratory of Systems Immunology, WPI Immunology Frontier Research Center, Osaka University, Osaka 565-0871, Japan firstname.lastname@example.org.Show MeSH
Mentions: The EC feature is illustrated in Figure 1A, using the class-I Archaeoglobus fulgidus CCA-adding enzyme bound to a tRNA fragment as an example. We found that for non-ribosomal and full datasets, the EC feature could improve the AUC by ∼1.3% and 0.8%, respectively, and also resulted in a better PR curve (Supplementary Figure S3) than the control method (sequence features used in the SRCPred method (12)). In order to quantify the information contained in each feature, we used EC and PSSM separately. We took the substitution frequency of each residue to itself in the PSSM profile, and normalized the frequencies via a logistic operator. We also included the 21-bit sparse coding feature and the GAC feature. The resulting AUCs were 0.7277 and 0.7075, respectively, for the EC- and PSSM-based model on the non-ribosomal dataset, and 0.8046 and 0.7942 on the complete dataset. These values verify that the EC feature contains additional information not found in the conservation values of the PSSM. We tested different E-value thresholds (1E-3, 1E-5 and 1E-10) for building MSA profiles, from which EC values were calculated. Using different E-values, a combination of E-values, or building a PSSM-like substitution matrix with occurrence frequencies for each of the 20 amino acid types did not result in an increase in performance. Therefore, the default E-value threshold was set to 1E-3. It should be noted that, depending on the number of homologous sequences in the database, the weight calculation step could be time-consuming. We were able to greatly speed this process up, however, by parallelization. After manually inspecting many known protein–RNA complexes, we could discern a rough correlation between residue conservation and distance to the bound RNA. As shown in Supplementary Figure S4A, the mean distance between protein surface residues and their bound RNAs was inversely related to the EC values. Moreover, RNA-binding residues were more enriched in large EC values than non-binding or background residues (Supplementary Figure S4B). However, as expected, conserved residues were not always near RNA binding sites.
Affiliation: Laboratory of Systems Immunology, WPI Immunology Frontier Research Center, Osaka University, Osaka 565-0871, Japan email@example.com.