Limits...
Automatic Selection of Order Parameters in the Analysis of Large Scale Molecular Dynamics Simulations.

Sultan MM, Kiss G, Shukla D, Pande VS - J Chem Theory Comput (2014)

Bottom Line: We address this challenge by introducing a method called clustering based feature selection (CB-FS) that employs a posterior analysis approach.It combines supervised machine learning (SML) and feature selection with Markov state models to automatically identify the relevant degrees of freedom that separate conformational states.We highlight the utility of the method in the evaluation of large-scale simulations and show that it can be used for the rapid and automated identification of relevant order parameters involved in the functional transitions of two exemplary cell-signaling proteins central to human disease states.

View Article: PubMed Central - PubMed

Affiliation: Department of Chemistry, Stanford University , 318 Campus Drive, Stanford, California 94305, United States.

ABSTRACT

Given the large number of crystal structures and NMR ensembles that have been solved to date, classical molecular dynamics (MD) simulations have become powerful tools in the atomistic study of the kinetics and thermodynamics of biomolecular systems on ever increasing time scales. By virtue of the high-dimensional conformational state space that is explored, the interpretation of large-scale simulations faces difficulties not unlike those in the big data community. We address this challenge by introducing a method called clustering based feature selection (CB-FS) that employs a posterior analysis approach. It combines supervised machine learning (SML) and feature selection with Markov state models to automatically identify the relevant degrees of freedom that separate conformational states. We highlight the utility of the method in the evaluation of large-scale simulations and show that it can be used for the rapid and automated identification of relevant order parameters involved in the functional transitions of two exemplary cell-signaling proteins central to human disease states.

No MeSH data available.


A) Terminally capped alanine dipeptide with the two central dihedralsmarked. B) ϕ vs ψ scatter plot of the Markov macrostateassignment, based on the heavy atom RMSD as the clustering criterion.C) The Gini importance of the features with the two dihedrals marked.Features 2 to 9 are Gaussian and uniform in noise. D). Recreationof plot B) by training a single classifier on the system dihedralsand employing the classifier to predict the Markov states of an unseentest data set. The contour lines going from white to black representthe decision boundaries learned on the data set. The plot containshalf the data points of B since the first half was used to train themodel. Protein images were generated using Visual Molecular Dynamics(VMD).27 The code used to generate theseplots can be found online at https://github.com/msultan/alanine_tree.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4263461&req=5

fig2: A) Terminally capped alanine dipeptide with the two central dihedralsmarked. B) ϕ vs ψ scatter plot of the Markov macrostateassignment, based on the heavy atom RMSD as the clustering criterion.C) The Gini importance of the features with the two dihedrals marked.Features 2 to 9 are Gaussian and uniform in noise. D). Recreationof plot B) by training a single classifier on the system dihedralsand employing the classifier to predict the Markov states of an unseentest data set. The contour lines going from white to black representthe decision boundaries learned on the data set. The plot containshalf the data points of B since the first half was used to train themodel. Protein images were generated using Visual Molecular Dynamics(VMD).27 The code used to generate theseplots can be found online at https://github.com/msultan/alanine_tree.

Mentions: Recent work by Schwanteset al.23 andPerez-Hernandaz et al.24 have used time-structureindependent components analysis (tICA) to find the slowest decorrelatingdegrees of freedom in the automatic construction of MSMs. In principle,we can perform similar order parameter selection by looking at thedegrees of freedom that are the most correlated with the eigenvectorsof the transition matrix or the independent components. Analyses ofthis nature are limited to the dynamic eigenvectors of the Markovmodels, and it is not immediately clear how this approach can be extendedto arbitrary combinations of states. Clustering based feature selectionhas the ability to work outside the Markov framework with arbitraryclustering methodologies. Insights about the partitioning can be drawnfrom the tree structure itself (Figure 1 andFigure 2d). Moreover, calculating the correlationof a single feature against dominant eigenvectors or independent componentswould suffer from its inability to model higher order additive effects–similarto calculating a single mutual information data point. The toy examplein Figure 1 highlights the difficulties thatarise from a symmetric interaction effect. A DT can readily modelthe additive affects to find the relevant features in the data. Projectingthe resulting decision boundaries back onto the original space givesthe state boundaries one would intuitively expect. Correlation basedmetrics such as mutual information (MI) and Pearson correlation (r) are incapable of modeling this effect and incorrectlyshow no association between either of the features and the outputstate. The details of how to calculate the mutual information andPearson correlation of a target feature against the output state aregiven in the SI.


Automatic Selection of Order Parameters in the Analysis of Large Scale Molecular Dynamics Simulations.

Sultan MM, Kiss G, Shukla D, Pande VS - J Chem Theory Comput (2014)

A) Terminally capped alanine dipeptide with the two central dihedralsmarked. B) ϕ vs ψ scatter plot of the Markov macrostateassignment, based on the heavy atom RMSD as the clustering criterion.C) The Gini importance of the features with the two dihedrals marked.Features 2 to 9 are Gaussian and uniform in noise. D). Recreationof plot B) by training a single classifier on the system dihedralsand employing the classifier to predict the Markov states of an unseentest data set. The contour lines going from white to black representthe decision boundaries learned on the data set. The plot containshalf the data points of B since the first half was used to train themodel. Protein images were generated using Visual Molecular Dynamics(VMD).27 The code used to generate theseplots can be found online at https://github.com/msultan/alanine_tree.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4263461&req=5

fig2: A) Terminally capped alanine dipeptide with the two central dihedralsmarked. B) ϕ vs ψ scatter plot of the Markov macrostateassignment, based on the heavy atom RMSD as the clustering criterion.C) The Gini importance of the features with the two dihedrals marked.Features 2 to 9 are Gaussian and uniform in noise. D). Recreationof plot B) by training a single classifier on the system dihedralsand employing the classifier to predict the Markov states of an unseentest data set. The contour lines going from white to black representthe decision boundaries learned on the data set. The plot containshalf the data points of B since the first half was used to train themodel. Protein images were generated using Visual Molecular Dynamics(VMD).27 The code used to generate theseplots can be found online at https://github.com/msultan/alanine_tree.
Mentions: Recent work by Schwanteset al.23 andPerez-Hernandaz et al.24 have used time-structureindependent components analysis (tICA) to find the slowest decorrelatingdegrees of freedom in the automatic construction of MSMs. In principle,we can perform similar order parameter selection by looking at thedegrees of freedom that are the most correlated with the eigenvectorsof the transition matrix or the independent components. Analyses ofthis nature are limited to the dynamic eigenvectors of the Markovmodels, and it is not immediately clear how this approach can be extendedto arbitrary combinations of states. Clustering based feature selectionhas the ability to work outside the Markov framework with arbitraryclustering methodologies. Insights about the partitioning can be drawnfrom the tree structure itself (Figure 1 andFigure 2d). Moreover, calculating the correlationof a single feature against dominant eigenvectors or independent componentswould suffer from its inability to model higher order additive effects–similarto calculating a single mutual information data point. The toy examplein Figure 1 highlights the difficulties thatarise from a symmetric interaction effect. A DT can readily modelthe additive affects to find the relevant features in the data. Projectingthe resulting decision boundaries back onto the original space givesthe state boundaries one would intuitively expect. Correlation basedmetrics such as mutual information (MI) and Pearson correlation (r) are incapable of modeling this effect and incorrectlyshow no association between either of the features and the outputstate. The details of how to calculate the mutual information andPearson correlation of a target feature against the output state aregiven in the SI.

Bottom Line: We address this challenge by introducing a method called clustering based feature selection (CB-FS) that employs a posterior analysis approach.It combines supervised machine learning (SML) and feature selection with Markov state models to automatically identify the relevant degrees of freedom that separate conformational states.We highlight the utility of the method in the evaluation of large-scale simulations and show that it can be used for the rapid and automated identification of relevant order parameters involved in the functional transitions of two exemplary cell-signaling proteins central to human disease states.

View Article: PubMed Central - PubMed

Affiliation: Department of Chemistry, Stanford University , 318 Campus Drive, Stanford, California 94305, United States.

ABSTRACT

Given the large number of crystal structures and NMR ensembles that have been solved to date, classical molecular dynamics (MD) simulations have become powerful tools in the atomistic study of the kinetics and thermodynamics of biomolecular systems on ever increasing time scales. By virtue of the high-dimensional conformational state space that is explored, the interpretation of large-scale simulations faces difficulties not unlike those in the big data community. We address this challenge by introducing a method called clustering based feature selection (CB-FS) that employs a posterior analysis approach. It combines supervised machine learning (SML) and feature selection with Markov state models to automatically identify the relevant degrees of freedom that separate conformational states. We highlight the utility of the method in the evaluation of large-scale simulations and show that it can be used for the rapid and automated identification of relevant order parameters involved in the functional transitions of two exemplary cell-signaling proteins central to human disease states.

No MeSH data available.