Inferring stabilizing mutations from protein phylogenies: application to influenza hemagglutinin.
Bottom Line:
We mathematically formalize this framework to develop a Bayesian approach for inferring the stability effects of individual mutations from homologous protein sequences of known phylogeny.This approach is able to predict published experimentally measured mutational stability effects (DeltaDeltaG values) with an accuracy that exceeds both a state-of-the-art physicochemical modeling program and the sequence-based consensus approach.Our work therefore describes a powerful new approach for predicting stabilizing mutations that can be successfully applied even to large, complex proteins such as hemagglutinin.
View Article:
PubMed Central - PubMed
Affiliation: Division of Biology, California Institute of Technology, Pasadena, California, USA. jesse.bloom@gmail.com
ABSTRACT
Show MeSH
One selection pressure shaping sequence evolution is the requirement that a protein fold with sufficient stability to perform its biological functions. We present a conceptual framework that explains how this requirement causes the probability that a particular amino acid mutation is fixed during evolution to depend on its effect on protein stability. We mathematically formalize this framework to develop a Bayesian approach for inferring the stability effects of individual mutations from homologous protein sequences of known phylogeny. This approach is able to predict published experimentally measured mutational stability effects (DeltaDeltaG values) with an accuracy that exceeds both a state-of-the-art physicochemical modeling program and the sequence-based consensus approach. As a further test, we use our phylogenetic inference approach to predict stabilizing mutations to influenza hemagglutinin. We introduce these mutations into a temperature-sensitive influenza virus with a defect in its hemagglutinin gene and experimentally demonstrate that some of the mutations allow the virus to grow at higher temperatures. Our work therefore describes a powerful new approach for predicting stabilizing mutations that can be successfully applied even to large, complex proteins such as hemagglutinin. This approach also makes a mathematical link between phylogenetics and experimentally measurable protein properties, potentially paving the way for more accurate analyses of molecular evolution. Related in: MedlinePlus |
Related In:
Results -
Collection
getmorefigures.php?uid=PMC2664478&req=5
Mentions: With this mean-field approximation, the issue becomes determining the average distribution of stabilities in an evolving population of proteins. This problem has been treated previously by simulations [29] and mathematically through matrix [31] and diffusion [32] equation approaches. The average distribution of stabilities turns out to depend on the degree of polymorphism in the population, with highly polymorphic populations (those with the product of the population size and the per sequence per generation mutation rate much greater than one) evolving to greater average stabilities than populations that are mostly monomorphic (those with ) [31],[67],[68]. Here we will consider only the case where the population is mostly monomorphic, so that all proteins tend to have converged to the same stability before a new mutation occurs (as is the case for the proteins shown in Figure 1). This choice is dictated by the fact that we are unclear how to incorporate the secondary selection for mutational robustness that occurs in highly polymorphic populations [31],[67],[68]. We acknowledge that some of the proteins that we analyze later in this paper (particularly influenza hemagglutinin) may actually evolve in populations that are highly polymorphic, and suggest that a mathematical treatment recognizing this fact is an area for future research. Given our choice to consider only the case where the population is mostly monomorphic, we will adopt the mathematical formalism described in [31] for the limit when (the more compact diffusion-equation approach of Shakhnovich and coworkers [32] cannot be used since it only applies when ). Following [31], we discretize the continuous variable of extra protein stability into small bins of width , and assign a protein to bin if it has extra stability such that , where . Here is some large integer giving an upper limit on the number of stability bins (so that all proteins in the evolving population have ). Note that all folded proteins fall into one of these bins, since proteins with fail to fold under the stability threshold model. Reference [31] finds that the distribution of average protein stabilities is well approximated by an exponential (see the middle panels of Figure 2 of this reference, or alternatively Figure 2A of [29]), such that the probability that a protein in the evolving population has extra stability that falls in bin is(3)where is a constant describing the steepness of the exponential. Figure 2 shows this distribution of protein stabilities graphically. Note that this exact mathematical form for is not proven in [31], but simply that all numerical solutions give distributions for that resemble this form. Other mathematical forms could be chosen for without altering the mathematical analysis that follows, although they might affect the actual numerical values that are ultimately inferred for the values. In particular, in highly polymorphic populations, the distribution of stabilities is peaked at a value slightly below the stability threshold (see right panels of Figure 2 of [31], Figure 2 of [32], or Figure 2B of [29]) rather than being an exponential. However, any distribution in which highly stable proteins are rare and marginally stable proteins are common should lead to qualitatively similar inferred values, since the subsequent analysis only employs the cumulative distribution function of in a rather coarse manner. Given the definition of in Equation 3, the exact numerical for simply sets a scale for the values (in conjunction with the bin size , it determines their units). As is described later in this paper, in our actual computational implementation, we chose a value for that placed the magnitude of the inferred values in the same dynamic range as the informative priors. |
View Article: PubMed Central - PubMed
Affiliation: Division of Biology, California Institute of Technology, Pasadena, California, USA. jesse.bloom@gmail.com