Limits...
Sigma-RF: prediction of the variability of spatial restraints in template-based modeling by random forest.

Lee J, Lee K, Joung I, Joo K, Brooks BR, Lee J - BMC Bioinformatics (2015)

Bottom Line: The benchmark results on 22 CASP9 targets show that the variability values from Sigma-RF are of higher correlations with the true distance deviation than those from Modeller.We assessed the effect of new sigma values by performing the single-domain homology modeling of 22 CASP9 targets and 24 CASP10 targets.For most of the targets tested, we could obtain more accurate 3D models from the identical alignments by using the Sigma-RF results than by using Modeller ones.

View Article: PubMed Central - PubMed

Affiliation: Laboratory of Computational Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, 5635 Fishers Ln, Bethesda, 20852, USA. juyong.lee@nih.gov.

ABSTRACT

Background: In template-based modeling when using a single template, inter-atomic distances of an unknown protein structure are assumed to be distributed by Gaussian probability density functions, whose center peaks are located at the distances between corresponding atoms in the template structure. The width of the Gaussian distribution, the variability of a spatial restraint, is closely related to the reliability of the restraint information extracted from a template, and it should be accurately estimated for successful template-based protein structure modeling.

Results: To predict the variability of the spatial restraints in template-based modeling, we have devised a prediction model, Sigma-RF, by using the random forest (RF) algorithm. The benchmark results on 22 CASP9 targets show that the variability values from Sigma-RF are of higher correlations with the true distance deviation than those from Modeller. We assessed the effect of new sigma values by performing the single-domain homology modeling of 22 CASP9 targets and 24 CASP10 targets. For most of the targets tested, we could obtain more accurate 3D models from the identical alignments by using the Sigma-RF results than by using Modeller ones.

Conclusions: We find that the average alignment quality of residues located between and at two aligned residues, quasi-local information, is the most contributing factor, by investigating the importance of input features used in the RF machine learning. This average alignment quality is shown to be more important than the previously identified quantity of a local information: the product of alignment qualities at two aligned residues.

Show MeSH
Predicted distance variability values are shown against actual distance errors for T0552 and T0598. The results of T0552 are shown in panel A and B, and those of T0598 are shown in panel C and D. The variability values by Sigma-RF, σRF, (green) show better correlation with true distance deviations, σnative=/dnative−dtemplate/, than those by Modeller, σModeller, (red). The blue lines represent the linear correlation, y=x.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4374281&req=5

Fig1: Predicted distance variability values are shown against actual distance errors for T0552 and T0598. The results of T0552 are shown in panel A and B, and those of T0598 are shown in panel C and D. The variability values by Sigma-RF, σRF, (green) show better correlation with true distance deviations, σnative=/dnative−dtemplate/, than those by Modeller, σModeller, (red). The blue lines represent the linear correlation, y=x.

Mentions: To illustrate details on the difference of results between Sigma-RF and Modeller, σ values of CACA restraints of T0552 and T0598 are shown in Figure 1. We observe that σModeller (red) tend to have rather smaller values and they are more narrowly distributed than σRF (green). We note that many highly inaccurate spatial restraints, /dn−dt/>10.0Å, are assigned to have rather small σ values by Modeller, σ<2. These small σ values can significantly lower the accuracy of thus-generated 3D protein models since the corresponding harmonic restraints will cause large penalty scores for the native structure, which would prevent the sampling of more native-like conformations. On the other hand, Sigma-RF provides relatively larger σ values for highly inaccurate distance restraints than Modeller does. For T0552 and T0598, all highly inaccurate restraints are predicted with larger σ values by Sigma-RF. This will lower the penalty from inaccurate distance restraints and will potentially allow one to sample more native-like conformations, which are inaccessible with small σ values from Modeller.Figure 1


Sigma-RF: prediction of the variability of spatial restraints in template-based modeling by random forest.

Lee J, Lee K, Joung I, Joo K, Brooks BR, Lee J - BMC Bioinformatics (2015)

Predicted distance variability values are shown against actual distance errors for T0552 and T0598. The results of T0552 are shown in panel A and B, and those of T0598 are shown in panel C and D. The variability values by Sigma-RF, σRF, (green) show better correlation with true distance deviations, σnative=/dnative−dtemplate/, than those by Modeller, σModeller, (red). The blue lines represent the linear correlation, y=x.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4374281&req=5

Fig1: Predicted distance variability values are shown against actual distance errors for T0552 and T0598. The results of T0552 are shown in panel A and B, and those of T0598 are shown in panel C and D. The variability values by Sigma-RF, σRF, (green) show better correlation with true distance deviations, σnative=/dnative−dtemplate/, than those by Modeller, σModeller, (red). The blue lines represent the linear correlation, y=x.
Mentions: To illustrate details on the difference of results between Sigma-RF and Modeller, σ values of CACA restraints of T0552 and T0598 are shown in Figure 1. We observe that σModeller (red) tend to have rather smaller values and they are more narrowly distributed than σRF (green). We note that many highly inaccurate spatial restraints, /dn−dt/>10.0Å, are assigned to have rather small σ values by Modeller, σ<2. These small σ values can significantly lower the accuracy of thus-generated 3D protein models since the corresponding harmonic restraints will cause large penalty scores for the native structure, which would prevent the sampling of more native-like conformations. On the other hand, Sigma-RF provides relatively larger σ values for highly inaccurate distance restraints than Modeller does. For T0552 and T0598, all highly inaccurate restraints are predicted with larger σ values by Sigma-RF. This will lower the penalty from inaccurate distance restraints and will potentially allow one to sample more native-like conformations, which are inaccessible with small σ values from Modeller.Figure 1

Bottom Line: The benchmark results on 22 CASP9 targets show that the variability values from Sigma-RF are of higher correlations with the true distance deviation than those from Modeller.We assessed the effect of new sigma values by performing the single-domain homology modeling of 22 CASP9 targets and 24 CASP10 targets.For most of the targets tested, we could obtain more accurate 3D models from the identical alignments by using the Sigma-RF results than by using Modeller ones.

View Article: PubMed Central - PubMed

Affiliation: Laboratory of Computational Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, 5635 Fishers Ln, Bethesda, 20852, USA. juyong.lee@nih.gov.

ABSTRACT

Background: In template-based modeling when using a single template, inter-atomic distances of an unknown protein structure are assumed to be distributed by Gaussian probability density functions, whose center peaks are located at the distances between corresponding atoms in the template structure. The width of the Gaussian distribution, the variability of a spatial restraint, is closely related to the reliability of the restraint information extracted from a template, and it should be accurately estimated for successful template-based protein structure modeling.

Results: To predict the variability of the spatial restraints in template-based modeling, we have devised a prediction model, Sigma-RF, by using the random forest (RF) algorithm. The benchmark results on 22 CASP9 targets show that the variability values from Sigma-RF are of higher correlations with the true distance deviation than those from Modeller. We assessed the effect of new sigma values by performing the single-domain homology modeling of 22 CASP9 targets and 24 CASP10 targets. For most of the targets tested, we could obtain more accurate 3D models from the identical alignments by using the Sigma-RF results than by using Modeller ones.

Conclusions: We find that the average alignment quality of residues located between and at two aligned residues, quasi-local information, is the most contributing factor, by investigating the importance of input features used in the RF machine learning. This average alignment quality is shown to be more important than the previously identified quantity of a local information: the product of alignment qualities at two aligned residues.

Show MeSH