Limits...
Accurate Ab Initio and Template-Based Prediction of Short Intrinsically-Disordered Regions by Bidirectional Recurrent Neural Networks Trained on Large-Scale Datasets.

Volpato V, Alshomrani B, Pollastri G - Int J Mol Sci (2015)

Bottom Line: The system exploits sequence and structural information in the forms of frequency profiles, predicted secondary structure and solvent accessibility and direct disorder annotations from homologous protein structures (templates) deposited in the Protein Data Bank.The contributions of sequence, structure and homology information result in large improvements in predictive accuracy.Additionally, the large scale of the training set leads to low false positive rates, making our systems a robust and efficient way to address high-throughput disorder prediction.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland. volpato.viola@gmail.com.

ABSTRACT
Intrinsically-disordered regions lack a well-defined 3D structure, but play key roles in determining the function of many proteins. Although predictors of disorder have been shown to achieve relatively high rates of correct classification of these segments, improvements over the the years have been slow, and accurate methods are needed that are capable of accommodating the ever-increasing amount of structurally-determined protein sequences to try to boost predictive performances. In this paper, we propose a predictor for short disordered regions based on bidirectional recurrent neural networks and tested by rigorous five-fold cross-validation on a large, non-redundant dataset collected from MobiDB, a new comprehensive source of protein disorder annotations. The system exploits sequence and structural information in the forms of frequency profiles, predicted secondary structure and solvent accessibility and direct disorder annotations from homologous protein structures (templates) deposited in the Protein Data Bank. The contributions of sequence, structure and homology information result in large improvements in predictive accuracy. Additionally, the large scale of the training set leads to low false positive rates, making our systems a robust and efficient way to address high-throughput disorder prediction.

No MeSH data available.


Precision-recall curves for all of our systems on X-ray test set data.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4581330&req=5

ijms-16-19868-f002: Precision-recall curves for all of our systems on X-ray test set data.

Mentions: Table 1, Table 2 and Table 3 report a summary of the results on both training and test sets using different combinations of evolutionary inputs, structural features predicted by Porter (secondary structure) and PaleAle (solvent accessibility), as well as weighted information from disordered templates as input to the BRNNs and to a standard NN containing 20 hidden units and taking as input a window of 21 residues. The number of false positives is limited, to prevent drawing false conclusions on the prevalence of disorder, without affecting the predictors’ sensitivity. This is shown in Figure 1 and Figure 2, and especially in Figure 3, which contains a typical ROC curve plot on the test set focused on low FPRs. A stringent 5% expected FPR threshold selected on the training dataset for our best predictor (MSA-SS-SA-Templ) gives the test set 0.751 sensitivity and 0.462 Prec at 5.5% FPR (black line on the plot). The very high specificity of our systems makes them potentially useful in high-throughput analysis of protein disorder, where the main goal is to create predictors producing a minimum number of false positives [47].


Accurate Ab Initio and Template-Based Prediction of Short Intrinsically-Disordered Regions by Bidirectional Recurrent Neural Networks Trained on Large-Scale Datasets.

Volpato V, Alshomrani B, Pollastri G - Int J Mol Sci (2015)

Precision-recall curves for all of our systems on X-ray test set data.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4581330&req=5

ijms-16-19868-f002: Precision-recall curves for all of our systems on X-ray test set data.
Mentions: Table 1, Table 2 and Table 3 report a summary of the results on both training and test sets using different combinations of evolutionary inputs, structural features predicted by Porter (secondary structure) and PaleAle (solvent accessibility), as well as weighted information from disordered templates as input to the BRNNs and to a standard NN containing 20 hidden units and taking as input a window of 21 residues. The number of false positives is limited, to prevent drawing false conclusions on the prevalence of disorder, without affecting the predictors’ sensitivity. This is shown in Figure 1 and Figure 2, and especially in Figure 3, which contains a typical ROC curve plot on the test set focused on low FPRs. A stringent 5% expected FPR threshold selected on the training dataset for our best predictor (MSA-SS-SA-Templ) gives the test set 0.751 sensitivity and 0.462 Prec at 5.5% FPR (black line on the plot). The very high specificity of our systems makes them potentially useful in high-throughput analysis of protein disorder, where the main goal is to create predictors producing a minimum number of false positives [47].

Bottom Line: The system exploits sequence and structural information in the forms of frequency profiles, predicted secondary structure and solvent accessibility and direct disorder annotations from homologous protein structures (templates) deposited in the Protein Data Bank.The contributions of sequence, structure and homology information result in large improvements in predictive accuracy.Additionally, the large scale of the training set leads to low false positive rates, making our systems a robust and efficient way to address high-throughput disorder prediction.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland. volpato.viola@gmail.com.

ABSTRACT
Intrinsically-disordered regions lack a well-defined 3D structure, but play key roles in determining the function of many proteins. Although predictors of disorder have been shown to achieve relatively high rates of correct classification of these segments, improvements over the the years have been slow, and accurate methods are needed that are capable of accommodating the ever-increasing amount of structurally-determined protein sequences to try to boost predictive performances. In this paper, we propose a predictor for short disordered regions based on bidirectional recurrent neural networks and tested by rigorous five-fold cross-validation on a large, non-redundant dataset collected from MobiDB, a new comprehensive source of protein disorder annotations. The system exploits sequence and structural information in the forms of frequency profiles, predicted secondary structure and solvent accessibility and direct disorder annotations from homologous protein structures (templates) deposited in the Protein Data Bank. The contributions of sequence, structure and homology information result in large improvements in predictive accuracy. Additionally, the large scale of the training set leads to low false positive rates, making our systems a robust and efficient way to address high-throughput disorder prediction.

No MeSH data available.