Limits...
Estimates of the effect of natural selection on protein-coding content.

Yap VB, Lindsay H, Easteal S, Huttley G - Mol. Biol. Evol. (2009)

Bottom Line: For protein-coding nucleotide sequences, the ratio of nonsynonymous to synonymous substitutions (omega) distinguishes neutrally evolving sequences (omega = 1) from those subjected to purifying (omega < 1) or positive Darwinian (omega > 1) selection.We show that current models used to estimate omega are substantially biased by naturally occurring sequence compositions.Our work has substantial implications for how vaccine targets are chosen and for studying the molecular basis of adaptive evolution.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore. stayapvb@nus.edu.sg

ABSTRACT
Analysis of natural selection is key to understanding many core biological processes, including the emergence of competition, cooperation, and complexity, and has important applications in the targeted development of vaccines. Selection is hard to observe directly but can be inferred from molecular sequence variation. For protein-coding nucleotide sequences, the ratio of nonsynonymous to synonymous substitutions (omega) distinguishes neutrally evolving sequences (omega = 1) from those subjected to purifying (omega < 1) or positive Darwinian (omega > 1) selection. We show that current models used to estimate omega are substantially biased by naturally occurring sequence compositions. We present a novel model that weights substitutions by conditional nucleotide frequencies and which escapes these artifacts. Applying it to the genomes of pathogens causing malaria, leprosy, tuberculosis, and Lyme disease gave significant discrepancies in estimates with approximately 10-30% of genes affected. Our work has substantial implications for how vaccine targets are chosen and for studying the molecular basis of adaptive evolution.

Show MeSH

Related in: MedlinePlus

Comparison of the effects of variation in real neutral processes on NF, CF, and CNF models. All alignments were exactly 49,998 nt long (see Theory and Methods). (A) Alignment GC% on the x axis against  from the “trinucleotide” models CFtri,HKY,s, NFtri,GTR,s, and CNFtri,GTR,s on the y axis (the best performing [i.e., least correlated] of the parameterizations considered; see Theory and Methods, Supplementary fig. S2, Supplementary Material online). The estimate and significance of Kendall's τ measure of association between GC% and  are shown for each panel. The dashed horizontal line is ω = 1. (B) A quantile–quantile plot using quantiles from χ12 (the expected  distribution) against quantiles of the LRs from testing the alternate (ω ≠ 1) against the  (ω = 1) hypotheses. The dashed diagonal line shows the expected case when the  hypothesis is correct.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2822286&req=5

fig2: Comparison of the effects of variation in real neutral processes on NF, CF, and CNF models. All alignments were exactly 49,998 nt long (see Theory and Methods). (A) Alignment GC% on the x axis against from the “trinucleotide” models CFtri,HKY,s, NFtri,GTR,s, and CNFtri,GTR,s on the y axis (the best performing [i.e., least correlated] of the parameterizations considered; see Theory and Methods, Supplementary fig. S2, Supplementary Material online). The estimate and significance of Kendall's τ measure of association between GC% and are shown for each panel. The dashed horizontal line is ω = 1. (B) A quantile–quantile plot using quantiles from χ12 (the expected distribution) against quantiles of the LRs from testing the alternate (ω ≠ 1) against the (ω = 1) hypotheses. The dashed diagonal line shows the expected case when the hypothesis is correct.

Mentions: The conditions affecting the evolution of real biological sequences are more complicated than those used for the simulation, with several neutral evolutionary factors, including biased gene conversion (Berglund et al. 2009; Galtier et al. 2009), identified as potentially confounding the estimation of ω from real protein-coding sequences (Supplementary fig. S1, Supplementary Material online). We sought to assess the robustness of the models to these biases. We chose primate introns as they experience the same mutagenic environment as their flanking exons (Green et al. 2003; Duret et al. 2006, Elango et al. 2008), they are not affected by selection for protein-coding content, and they are mostly nonfunctional (Siepel et al. 2005). Using these sequences, we were able to demonstrate that ω = 1 (i.e., no effect of protein-coding selection) from our new model but that previous models led to inaccurate estimates of ω ≠ 1. ω was estimated from intronic sequences using the trinucleotide rather than codon model variants as introns can include trinucleotides that are invalid for codon models (see Theory and Methods). For each model form, the best performing (least correlated) of the parameterizations considered (GTR and HKY; Hasegawa et al. 1985) is shown in figure 2A (see Supplementary fig. S2a, Supplementary Material online, for the remainder). As predicted, ω was significantly correlated with composition even for the best performing CFtri,s model (fig. 2). Only from NFtri,GTR,s and CNFtri,GTR,s did not have a significant association with GC% (fig. 2A). The consistency of the models with the hypothesis ω = 1 is indicated by the quantile–quantile plots. These confirm that CNFtri,GTR,s best matches theoretical expectation and that the other models are more prone to false positives (fig. 2B and Supplementary fig. S2b, Supplementary Material online). The strong consistency between CNFtri,GTR,s and NFtri,GTR,s in these analyses stems from their being trinucleotide models. The codon NF model will exhibit greater bias due to its treatment of stop codons (Theory and Methods). Our analysis of intronic sequences combined with the relationship between the trinucleotide and codon model forms (Theory and Methods) establishes CNFGTR as the most robust form for estimating ω.


Estimates of the effect of natural selection on protein-coding content.

Yap VB, Lindsay H, Easteal S, Huttley G - Mol. Biol. Evol. (2009)

Comparison of the effects of variation in real neutral processes on NF, CF, and CNF models. All alignments were exactly 49,998 nt long (see Theory and Methods). (A) Alignment GC% on the x axis against  from the “trinucleotide” models CFtri,HKY,s, NFtri,GTR,s, and CNFtri,GTR,s on the y axis (the best performing [i.e., least correlated] of the parameterizations considered; see Theory and Methods, Supplementary fig. S2, Supplementary Material online). The estimate and significance of Kendall's τ measure of association between GC% and  are shown for each panel. The dashed horizontal line is ω = 1. (B) A quantile–quantile plot using quantiles from χ12 (the expected  distribution) against quantiles of the LRs from testing the alternate (ω ≠ 1) against the  (ω = 1) hypotheses. The dashed diagonal line shows the expected case when the  hypothesis is correct.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2822286&req=5

fig2: Comparison of the effects of variation in real neutral processes on NF, CF, and CNF models. All alignments were exactly 49,998 nt long (see Theory and Methods). (A) Alignment GC% on the x axis against from the “trinucleotide” models CFtri,HKY,s, NFtri,GTR,s, and CNFtri,GTR,s on the y axis (the best performing [i.e., least correlated] of the parameterizations considered; see Theory and Methods, Supplementary fig. S2, Supplementary Material online). The estimate and significance of Kendall's τ measure of association between GC% and are shown for each panel. The dashed horizontal line is ω = 1. (B) A quantile–quantile plot using quantiles from χ12 (the expected distribution) against quantiles of the LRs from testing the alternate (ω ≠ 1) against the (ω = 1) hypotheses. The dashed diagonal line shows the expected case when the hypothesis is correct.
Mentions: The conditions affecting the evolution of real biological sequences are more complicated than those used for the simulation, with several neutral evolutionary factors, including biased gene conversion (Berglund et al. 2009; Galtier et al. 2009), identified as potentially confounding the estimation of ω from real protein-coding sequences (Supplementary fig. S1, Supplementary Material online). We sought to assess the robustness of the models to these biases. We chose primate introns as they experience the same mutagenic environment as their flanking exons (Green et al. 2003; Duret et al. 2006, Elango et al. 2008), they are not affected by selection for protein-coding content, and they are mostly nonfunctional (Siepel et al. 2005). Using these sequences, we were able to demonstrate that ω = 1 (i.e., no effect of protein-coding selection) from our new model but that previous models led to inaccurate estimates of ω ≠ 1. ω was estimated from intronic sequences using the trinucleotide rather than codon model variants as introns can include trinucleotides that are invalid for codon models (see Theory and Methods). For each model form, the best performing (least correlated) of the parameterizations considered (GTR and HKY; Hasegawa et al. 1985) is shown in figure 2A (see Supplementary fig. S2a, Supplementary Material online, for the remainder). As predicted, ω was significantly correlated with composition even for the best performing CFtri,s model (fig. 2). Only from NFtri,GTR,s and CNFtri,GTR,s did not have a significant association with GC% (fig. 2A). The consistency of the models with the hypothesis ω = 1 is indicated by the quantile–quantile plots. These confirm that CNFtri,GTR,s best matches theoretical expectation and that the other models are more prone to false positives (fig. 2B and Supplementary fig. S2b, Supplementary Material online). The strong consistency between CNFtri,GTR,s and NFtri,GTR,s in these analyses stems from their being trinucleotide models. The codon NF model will exhibit greater bias due to its treatment of stop codons (Theory and Methods). Our analysis of intronic sequences combined with the relationship between the trinucleotide and codon model forms (Theory and Methods) establishes CNFGTR as the most robust form for estimating ω.

Bottom Line: For protein-coding nucleotide sequences, the ratio of nonsynonymous to synonymous substitutions (omega) distinguishes neutrally evolving sequences (omega = 1) from those subjected to purifying (omega < 1) or positive Darwinian (omega > 1) selection.We show that current models used to estimate omega are substantially biased by naturally occurring sequence compositions.Our work has substantial implications for how vaccine targets are chosen and for studying the molecular basis of adaptive evolution.

View Article: PubMed Central - PubMed

Affiliation: Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore. stayapvb@nus.edu.sg

ABSTRACT
Analysis of natural selection is key to understanding many core biological processes, including the emergence of competition, cooperation, and complexity, and has important applications in the targeted development of vaccines. Selection is hard to observe directly but can be inferred from molecular sequence variation. For protein-coding nucleotide sequences, the ratio of nonsynonymous to synonymous substitutions (omega) distinguishes neutrally evolving sequences (omega = 1) from those subjected to purifying (omega < 1) or positive Darwinian (omega > 1) selection. We show that current models used to estimate omega are substantially biased by naturally occurring sequence compositions. We present a novel model that weights substitutions by conditional nucleotide frequencies and which escapes these artifacts. Applying it to the genomes of pathogens causing malaria, leprosy, tuberculosis, and Lyme disease gave significant discrepancies in estimates with approximately 10-30% of genes affected. Our work has substantial implications for how vaccine targets are chosen and for studying the molecular basis of adaptive evolution.

Show MeSH
Related in: MedlinePlus