Limits...
De-novo protein function prediction using DNA binding and RNA binding proteins as a test case

View Article: PubMed Central - PubMed

ABSTRACT

Of the currently identified protein sequences, 99.6% have never been observed in the laboratory as proteins and their molecular function has not been established experimentally. Predicting the function of such proteins relies mostly on annotated homologs. However, this has resulted in some erroneous annotations, and many proteins have no annotated homologs. Here we propose a de-novo function prediction approach based on identifying biophysical features that underlie function. Using our approach, we discover DNA and RNA binding proteins that cannot be identified based on homology and validate these predictions experimentally. For example, FGF14, which belongs to a family of secreted growth factors was predicted to bind DNA. We verify this experimentally and also show that FGF14 is localized to the nucleus. Mutating the predicted binding site on FGF14 abrogated DNA binding. These results demonstrate the feasibility of automated de-novo function prediction based on identifying function-related biophysical features.

No MeSH data available.


Precision-recall curves (PRC) for predicting NA binding residues.(a) PRC for DNA-binding residues prediction. Arrow marks the cutoff score that yielded precision of 0.65 and recall of 0.23 (b). PRC for RNA-binding residues prediction.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5121330&req=5

f1: Precision-recall curves (PRC) for predicting NA binding residues.(a) PRC for DNA-binding residues prediction. Arrow marks the cutoff score that yielded precision of 0.65 and recall of 0.23 (b). PRC for RNA-binding residues prediction.

Mentions: One RF model was trained to predict DNA binding residues, and another RF model was trained to predict RNA binding residues. Figure 1a shows the precision-recall curve (PRC) for predicting DNA-binding residues. To train these machines, each residue was described by 246 features that included: predicted structural features (secondary structure, solvent accessibility and disorder) as well as sequence features (type of amino acid, sequence neighbors and their characteristics, conservation, entropy. See Methods for details). Figure 1b presents the PRC for predicting RNA-binding residues. The light-grey line represents the expected performance of a random guess. The PRC shows a tradeoff between the measures; high precision at low recalls and vice versa. For example, in Fig. 1a an arrow points to the performance of a cutoff score that yielded precision of 0.65 (i.e., at this cutoff ∼65% of the residues that are predicted to bind DNA are indeed observed experimentally to bind DNA), recall was 0.23 (that is, at this cutoff, ∼23% of the residues that are observed experimentally to bind DNA were identified as such by the model). One can choose any point along this curve as the operation point for the prediction method.


De-novo protein function prediction using DNA binding and RNA binding proteins as a test case
Precision-recall curves (PRC) for predicting NA binding residues.(a) PRC for DNA-binding residues prediction. Arrow marks the cutoff score that yielded precision of 0.65 and recall of 0.23 (b). PRC for RNA-binding residues prediction.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5121330&req=5

f1: Precision-recall curves (PRC) for predicting NA binding residues.(a) PRC for DNA-binding residues prediction. Arrow marks the cutoff score that yielded precision of 0.65 and recall of 0.23 (b). PRC for RNA-binding residues prediction.
Mentions: One RF model was trained to predict DNA binding residues, and another RF model was trained to predict RNA binding residues. Figure 1a shows the precision-recall curve (PRC) for predicting DNA-binding residues. To train these machines, each residue was described by 246 features that included: predicted structural features (secondary structure, solvent accessibility and disorder) as well as sequence features (type of amino acid, sequence neighbors and their characteristics, conservation, entropy. See Methods for details). Figure 1b presents the PRC for predicting RNA-binding residues. The light-grey line represents the expected performance of a random guess. The PRC shows a tradeoff between the measures; high precision at low recalls and vice versa. For example, in Fig. 1a an arrow points to the performance of a cutoff score that yielded precision of 0.65 (i.e., at this cutoff ∼65% of the residues that are predicted to bind DNA are indeed observed experimentally to bind DNA), recall was 0.23 (that is, at this cutoff, ∼23% of the residues that are observed experimentally to bind DNA were identified as such by the model). One can choose any point along this curve as the operation point for the prediction method.

View Article: PubMed Central - PubMed

ABSTRACT

Of the currently identified protein sequences, 99.6% have never been observed in the laboratory as proteins and their molecular function has not been established experimentally. Predicting the function of such proteins relies mostly on annotated homologs. However, this has resulted in some erroneous annotations, and many proteins have no annotated homologs. Here we propose a de-novo function prediction approach based on identifying biophysical features that underlie function. Using our approach, we discover DNA and RNA binding proteins that cannot be identified based on homology and validate these predictions experimentally. For example, FGF14, which belongs to a family of secreted growth factors was predicted to bind DNA. We verify this experimentally and also show that FGF14 is localized to the nucleus. Mutating the predicted binding site on FGF14 abrogated DNA binding. These results demonstrate the feasibility of automated de-novo function prediction based on identifying function-related biophysical features.

No MeSH data available.