Limits...
Free kick instead of cross-validation in maximum-likelihood refinement of macromolecular crystal structures.

Pražnikar J, Turk D - Acta Crystallogr. D Biol. Crystallogr. (2014)

Bottom Line: They utilize phase-error estimates that are calculated from a small fraction of diffraction data, called the test set, that are not used to fit the model.It is called ML free-kick refinement as it uses the ML formulation of the target function and is based on the idea of freeing the model from the model bias imposed by the chemical energy restraints used in refinement.This approach for the calculation of error estimates is superior to the cross-validation approach: it reduces the phase error and increases the accuracy of molecular models, is more robust, provides clearer maps and may use a smaller portion of data for the test set for the calculation of Rfree or may leave it out completely.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biochemistry and Molecular and Structural Biology, Institute Joǽef Stefan, Jamova 39, 1000 Ljubljana, Slovenia.

ABSTRACT
The refinement of a molecular model is a computational procedure by which the atomic model is fitted to the diffraction data. The commonly used target in the refinement of macromolecular structures is the maximum-likelihood (ML) function, which relies on the assessment of model errors. The current ML functions rely on cross-validation. They utilize phase-error estimates that are calculated from a small fraction of diffraction data, called the test set, that are not used to fit the model. An approach has been developed that uses the work set to calculate the phase-error estimates in the ML refinement from simulating the model errors via the random displacement of atomic coordinates. It is called ML free-kick refinement as it uses the ML formulation of the target function and is based on the idea of freeing the model from the model bias imposed by the chemical energy restraints used in refinement. This approach for the calculation of error estimates is superior to the cross-validation approach: it reduces the phase error and increases the accuracy of molecular models, is more robust, provides clearer maps and may use a smaller portion of data for the test set for the calculation of Rfree or may leave it out completely.

Show MeSH
R factors against phase errors for ML CV. The cases used in the columns from left to right are cherry allergen (PDB entry 2ahn), stefin B tetramer (2oct), cathepsin H (8pch), ammodytin L (3dih) and choline acetyltransferase (2fy2). The test-set sizes are 1, 2, 5 and 10%, indicated by T1, T2, T5 and T10, respectively. Red dashed lines show the starting phase error. Rwork, Rfree and Rfree–Rwork are plotted as dots for each of the 31 cases on the vertical axes in blue, green and orange, respectively.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4257616&req=5

fig6: R factors against phase errors for ML CV. The cases used in the columns from left to right are cherry allergen (PDB entry 2ahn), stefin B tetramer (2oct), cathepsin H (8pch), ammodytin L (3dih) and choline acetyltransferase (2fy2). The test-set sizes are 1, 2, 5 and 10%, indicated by T1, T2, T5 and T10, respectively. Red dashed lines show the starting phase error. Rwork, Rfree and Rfree–Rwork are plotted as dots for each of the 31 cases on the vertical axes in blue, green and orange, respectively.

Mentions: The Rfree value distribution (Fig. 5 ▶) appears to be related to the size of the test-set portion, irrespective of the ML approach used. The Rfree value distribution is tightest at 10% of the data and widest at 1%. This analysis indicates that the use of a larger test-set size to calculate Rfree stabilizes the calculation and is less prone to accidental choice of the data in the test set. However, the increase in the test-set size has undesirable consequences: it decreases the work set, which consequently lowers the number of reflections used in the refinement procedure and thereby increases the phase error. A similar relationship holds for the Rfree–Rwork difference (Fig. 5 ▶). In contrast, the Rwork distributions (Fig. 4 ▶) evidently show less variability for the ML FK approach than for the ML CV approach. Clearly, the wider distribution of Rwork for the ML CV approach is a consequence of the small amount of data used in the test set (small sample size) and is therefore more sensitive to their accidental choice. The opposite is true for the ML FK approach, where most of the data used to calculate the α and β coefficients of the ML function are the same. This analysis indicates that the data set included in the test set is underrepresented to enable robust and convergent refinement at all stages of structure determination. In particular, this approach increases the spread of possible solutions at large phase errors. This relationship is also reflected in the trends on comparison of the phase errors (Fig. 4 ▶) and Rwork (Fig. 6 ▶): the decrease in the Rwork values in the ML FK approach reflects the decrease of the size of the work set, while this trend is not pronounced in the Rwork plots for the ML CV approach. This analysis indicates that Rwork can be lowered when a smaller number of reflections are fitted, whereas the use of a smaller number of reflections in refinement also results in models that deviate more from their true target.


Free kick instead of cross-validation in maximum-likelihood refinement of macromolecular crystal structures.

Pražnikar J, Turk D - Acta Crystallogr. D Biol. Crystallogr. (2014)

R factors against phase errors for ML CV. The cases used in the columns from left to right are cherry allergen (PDB entry 2ahn), stefin B tetramer (2oct), cathepsin H (8pch), ammodytin L (3dih) and choline acetyltransferase (2fy2). The test-set sizes are 1, 2, 5 and 10%, indicated by T1, T2, T5 and T10, respectively. Red dashed lines show the starting phase error. Rwork, Rfree and Rfree–Rwork are plotted as dots for each of the 31 cases on the vertical axes in blue, green and orange, respectively.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4257616&req=5

fig6: R factors against phase errors for ML CV. The cases used in the columns from left to right are cherry allergen (PDB entry 2ahn), stefin B tetramer (2oct), cathepsin H (8pch), ammodytin L (3dih) and choline acetyltransferase (2fy2). The test-set sizes are 1, 2, 5 and 10%, indicated by T1, T2, T5 and T10, respectively. Red dashed lines show the starting phase error. Rwork, Rfree and Rfree–Rwork are plotted as dots for each of the 31 cases on the vertical axes in blue, green and orange, respectively.
Mentions: The Rfree value distribution (Fig. 5 ▶) appears to be related to the size of the test-set portion, irrespective of the ML approach used. The Rfree value distribution is tightest at 10% of the data and widest at 1%. This analysis indicates that the use of a larger test-set size to calculate Rfree stabilizes the calculation and is less prone to accidental choice of the data in the test set. However, the increase in the test-set size has undesirable consequences: it decreases the work set, which consequently lowers the number of reflections used in the refinement procedure and thereby increases the phase error. A similar relationship holds for the Rfree–Rwork difference (Fig. 5 ▶). In contrast, the Rwork distributions (Fig. 4 ▶) evidently show less variability for the ML FK approach than for the ML CV approach. Clearly, the wider distribution of Rwork for the ML CV approach is a consequence of the small amount of data used in the test set (small sample size) and is therefore more sensitive to their accidental choice. The opposite is true for the ML FK approach, where most of the data used to calculate the α and β coefficients of the ML function are the same. This analysis indicates that the data set included in the test set is underrepresented to enable robust and convergent refinement at all stages of structure determination. In particular, this approach increases the spread of possible solutions at large phase errors. This relationship is also reflected in the trends on comparison of the phase errors (Fig. 4 ▶) and Rwork (Fig. 6 ▶): the decrease in the Rwork values in the ML FK approach reflects the decrease of the size of the work set, while this trend is not pronounced in the Rwork plots for the ML CV approach. This analysis indicates that Rwork can be lowered when a smaller number of reflections are fitted, whereas the use of a smaller number of reflections in refinement also results in models that deviate more from their true target.

Bottom Line: They utilize phase-error estimates that are calculated from a small fraction of diffraction data, called the test set, that are not used to fit the model.It is called ML free-kick refinement as it uses the ML formulation of the target function and is based on the idea of freeing the model from the model bias imposed by the chemical energy restraints used in refinement.This approach for the calculation of error estimates is superior to the cross-validation approach: it reduces the phase error and increases the accuracy of molecular models, is more robust, provides clearer maps and may use a smaller portion of data for the test set for the calculation of Rfree or may leave it out completely.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Biochemistry and Molecular and Structural Biology, Institute Joǽef Stefan, Jamova 39, 1000 Ljubljana, Slovenia.

ABSTRACT
The refinement of a molecular model is a computational procedure by which the atomic model is fitted to the diffraction data. The commonly used target in the refinement of macromolecular structures is the maximum-likelihood (ML) function, which relies on the assessment of model errors. The current ML functions rely on cross-validation. They utilize phase-error estimates that are calculated from a small fraction of diffraction data, called the test set, that are not used to fit the model. An approach has been developed that uses the work set to calculate the phase-error estimates in the ML refinement from simulating the model errors via the random displacement of atomic coordinates. It is called ML free-kick refinement as it uses the ML formulation of the target function and is based on the idea of freeing the model from the model bias imposed by the chemical energy restraints used in refinement. This approach for the calculation of error estimates is superior to the cross-validation approach: it reduces the phase error and increases the accuracy of molecular models, is more robust, provides clearer maps and may use a smaller portion of data for the test set for the calculation of Rfree or may leave it out completely.

Show MeSH