Limits...
Accurate Ab Initio and Template-Based Prediction of Short Intrinsically-Disordered Regions by Bidirectional Recurrent Neural Networks Trained on Large-Scale Datasets.

Volpato V, Alshomrani B, Pollastri G - Int J Mol Sci (2015)

Bottom Line: The system exploits sequence and structural information in the forms of frequency profiles, predicted secondary structure and solvent accessibility and direct disorder annotations from homologous protein structures (templates) deposited in the Protein Data Bank.The contributions of sequence, structure and homology information result in large improvements in predictive accuracy.Additionally, the large scale of the training set leads to low false positive rates, making our systems a robust and efficient way to address high-throughput disorder prediction.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland. volpato.viola@gmail.com.

ABSTRACT
Intrinsically-disordered regions lack a well-defined 3D structure, but play key roles in determining the function of many proteins. Although predictors of disorder have been shown to achieve relatively high rates of correct classification of these segments, improvements over the the years have been slow, and accurate methods are needed that are capable of accommodating the ever-increasing amount of structurally-determined protein sequences to try to boost predictive performances. In this paper, we propose a predictor for short disordered regions based on bidirectional recurrent neural networks and tested by rigorous five-fold cross-validation on a large, non-redundant dataset collected from MobiDB, a new comprehensive source of protein disorder annotations. The system exploits sequence and structural information in the forms of frequency profiles, predicted secondary structure and solvent accessibility and direct disorder annotations from homologous protein structures (templates) deposited in the Protein Data Bank. The contributions of sequence, structure and homology information result in large improvements in predictive accuracy. Additionally, the large scale of the training set leads to low false positive rates, making our systems a robust and efficient way to address high-throughput disorder prediction.

No MeSH data available.


Distribution of the area under the ROC curve as a function of sequence identity of the examples to the best template that could be found. Black bars represent template-based results (MSA-Templ) and white bars sequence-based results (MSA) (top); The number of proteins with templates within given ranges of sequence identity. Black bins represent average sequence identity; grey bins identity to the best template (bottom).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4581330&req=5

ijms-16-19868-f004: Distribution of the area under the ROC curve as a function of sequence identity of the examples to the best template that could be found. Black bars represent template-based results (MSA-Templ) and white bars sequence-based results (MSA) (top); The number of proteins with templates within given ranges of sequence identity. Black bins represent average sequence identity; grey bins identity to the best template (bottom).

Mentions: To further explore the relationship between performance and structural information from templates, in Figure 4, the histogram of the AUC is plotted as a function of sequence identity between the query and its best template, and the numbers of proteins in different classes of sequence identity to the best template are also reported. In order to measure the effect of sequence identity to the best template (that is, how much the quality of templates affects the performances of the system), we subdivide the results into bins of 10% sequence identity. Two different sets of results in this binned version are computed: results including homology (MSA-Templ) reported as black bars; and sequence-based results (MSA) reported as grey bars. Similarly to the case of protein function, thresholds to transfer disorder annotation tend to be higher than those for transferring structure. The MSA-Templ results are clearly superior to the MSA ones (sequence-based) for values of sequence identity to the best template above 50% and only marginally so in the 30% to 50% range. The distribution of templates is also represented in Figure 4. Overall, template-based results are more accurate than sequence-based ones in approximately two thirds of the cases in our experiments.


Accurate Ab Initio and Template-Based Prediction of Short Intrinsically-Disordered Regions by Bidirectional Recurrent Neural Networks Trained on Large-Scale Datasets.

Volpato V, Alshomrani B, Pollastri G - Int J Mol Sci (2015)

Distribution of the area under the ROC curve as a function of sequence identity of the examples to the best template that could be found. Black bars represent template-based results (MSA-Templ) and white bars sequence-based results (MSA) (top); The number of proteins with templates within given ranges of sequence identity. Black bins represent average sequence identity; grey bins identity to the best template (bottom).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4581330&req=5

ijms-16-19868-f004: Distribution of the area under the ROC curve as a function of sequence identity of the examples to the best template that could be found. Black bars represent template-based results (MSA-Templ) and white bars sequence-based results (MSA) (top); The number of proteins with templates within given ranges of sequence identity. Black bins represent average sequence identity; grey bins identity to the best template (bottom).
Mentions: To further explore the relationship between performance and structural information from templates, in Figure 4, the histogram of the AUC is plotted as a function of sequence identity between the query and its best template, and the numbers of proteins in different classes of sequence identity to the best template are also reported. In order to measure the effect of sequence identity to the best template (that is, how much the quality of templates affects the performances of the system), we subdivide the results into bins of 10% sequence identity. Two different sets of results in this binned version are computed: results including homology (MSA-Templ) reported as black bars; and sequence-based results (MSA) reported as grey bars. Similarly to the case of protein function, thresholds to transfer disorder annotation tend to be higher than those for transferring structure. The MSA-Templ results are clearly superior to the MSA ones (sequence-based) for values of sequence identity to the best template above 50% and only marginally so in the 30% to 50% range. The distribution of templates is also represented in Figure 4. Overall, template-based results are more accurate than sequence-based ones in approximately two thirds of the cases in our experiments.

Bottom Line: The system exploits sequence and structural information in the forms of frequency profiles, predicted secondary structure and solvent accessibility and direct disorder annotations from homologous protein structures (templates) deposited in the Protein Data Bank.The contributions of sequence, structure and homology information result in large improvements in predictive accuracy.Additionally, the large scale of the training set leads to low false positive rates, making our systems a robust and efficient way to address high-throughput disorder prediction.

View Article: PubMed Central - PubMed

Affiliation: School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland. volpato.viola@gmail.com.

ABSTRACT
Intrinsically-disordered regions lack a well-defined 3D structure, but play key roles in determining the function of many proteins. Although predictors of disorder have been shown to achieve relatively high rates of correct classification of these segments, improvements over the the years have been slow, and accurate methods are needed that are capable of accommodating the ever-increasing amount of structurally-determined protein sequences to try to boost predictive performances. In this paper, we propose a predictor for short disordered regions based on bidirectional recurrent neural networks and tested by rigorous five-fold cross-validation on a large, non-redundant dataset collected from MobiDB, a new comprehensive source of protein disorder annotations. The system exploits sequence and structural information in the forms of frequency profiles, predicted secondary structure and solvent accessibility and direct disorder annotations from homologous protein structures (templates) deposited in the Protein Data Bank. The contributions of sequence, structure and homology information result in large improvements in predictive accuracy. Additionally, the large scale of the training set leads to low false positive rates, making our systems a robust and efficient way to address high-throughput disorder prediction.

No MeSH data available.