Limits...
A branch-heterogeneous model of protein evolution for efficient inference of ancestral sequences.

Groussin M, Boussau B, Gouy M - Syst. Biol. (2013)

Bottom Line: This model, henceforth called Correspondence and likelihood analysis (COaLA), makes use of a correspondence analysis to reduce the number of parameters to be optimized through maximum likelihood, focusing on most of the compositional variation observed in the data.COaLA efficiently estimates ancestral amino acid frequencies and sequences, making it relevant for studies aiming at reconstructing and resurrecting ancestral amino acid sequences.Finally, we applied COaLA on a concatenate of universal amino acid sequences to confirm previous results obtained with a nonhomogeneous Bayesian model regarding the early pattern of adaptation to optimal growth temperature, supporting the mesophilic nature of the Last Universal Common Ancestor.

View Article: PubMed Central - PubMed

Affiliation: Laboratoire de Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, UMR5558, Villeurbanne, France. mathieu.groussin@univ-lyon1.fr

ABSTRACT
Most models of nucleotide or amino acid substitution used in phylogenetic studies assume that the evolutionary process has been homogeneous across lineages and that composition of nucleotides or amino acids has remained the same throughout the tree. These oversimplified assumptions are refuted by the observation that compositional variability characterizes extant biological sequences. Branch-heterogeneous models of protein evolution that account for compositional variability have been developed, but are not yet in common use because of the large number of parameters required, leading to high computational costs and potential overparameterization. Here, we present a new branch-nonhomogeneous and nonstationary model of protein evolution that captures more accurately the high complexity of sequence evolution. This model, henceforth called Correspondence and likelihood analysis (COaLA), makes use of a correspondence analysis to reduce the number of parameters to be optimized through maximum likelihood, focusing on most of the compositional variation observed in the data. The model was thoroughly tested on both simulated and biological data sets to show its high performance in terms of data fitting and CPU time. COaLA efficiently estimates ancestral amino acid frequencies and sequences, making it relevant for studies aiming at reconstructing and resurrecting ancestral amino acid sequences. Finally, we applied COaLA on a concatenate of universal amino acid sequences to confirm previous results obtained with a nonhomogeneous Bayesian model regarding the early pattern of adaptation to optimal growth temperature, supporting the mesophilic nature of the Last Universal Common Ancestor.

Show MeSH
Accuracy of estimation of ancestral root amino acid frequencies. On the y-axes, the differences between inferred amino acids frequencies by ML and true amino acid frequencies used to simulate sequences are represented. a) Results obtained with the H–S LG+Fopt model. b) Results obtained with the H–NS LG+COaLA[2] model. c) Results obtained with the NH–NS LG+COaLA[2] model.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC3676677&req=5

Figure 2: Accuracy of estimation of ancestral root amino acid frequencies. On the y-axes, the differences between inferred amino acids frequencies by ML and true amino acid frequencies used to simulate sequences are represented. a) Results obtained with the H–S LG+Fopt model. b) Results obtained with the H–NS LG+COaLA[2] model. c) Results obtained with the NH–NS LG+COaLA[2] model.

Mentions: Figure 2 shows that for the first 1000 alignments, both the NH–NS and H–NS LG+COaLA[2] models outperform the H–S LG+F model when it comes to estimating ancestral root frequencies. The sums of the squared differences between true and inferred amino acid frequencies are equal to 4.38, 1.15, and 0.98 for the H–S, H–NS, and NH–NS models, respectively, with the NH–NS model exhibiting slightly better performances than the H–NS approach (Wilcoxon paired test, P < 0.001). Furthermore, we observed that for both rare (such as cysteine or tryptophan) or frequent amino acids (such as alanine), the NH–NS COaLA model remains the best (P < 0.001) at estimating ancestral frequencies at the root (the sums of the squared differences are, in the same order as before, 0.018, 0.0025, and 0.0022 for tryptophan and 0.398, 0.112, and 0.098 for alanine). This might be explained by the fact that COA is equally sensitive to deviations in rare amino acids as it is to deviations in frequent amino acids.


A branch-heterogeneous model of protein evolution for efficient inference of ancestral sequences.

Groussin M, Boussau B, Gouy M - Syst. Biol. (2013)

Accuracy of estimation of ancestral root amino acid frequencies. On the y-axes, the differences between inferred amino acids frequencies by ML and true amino acid frequencies used to simulate sequences are represented. a) Results obtained with the H–S LG+Fopt model. b) Results obtained with the H–NS LG+COaLA[2] model. c) Results obtained with the NH–NS LG+COaLA[2] model.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC3676677&req=5

Figure 2: Accuracy of estimation of ancestral root amino acid frequencies. On the y-axes, the differences between inferred amino acids frequencies by ML and true amino acid frequencies used to simulate sequences are represented. a) Results obtained with the H–S LG+Fopt model. b) Results obtained with the H–NS LG+COaLA[2] model. c) Results obtained with the NH–NS LG+COaLA[2] model.
Mentions: Figure 2 shows that for the first 1000 alignments, both the NH–NS and H–NS LG+COaLA[2] models outperform the H–S LG+F model when it comes to estimating ancestral root frequencies. The sums of the squared differences between true and inferred amino acid frequencies are equal to 4.38, 1.15, and 0.98 for the H–S, H–NS, and NH–NS models, respectively, with the NH–NS model exhibiting slightly better performances than the H–NS approach (Wilcoxon paired test, P < 0.001). Furthermore, we observed that for both rare (such as cysteine or tryptophan) or frequent amino acids (such as alanine), the NH–NS COaLA model remains the best (P < 0.001) at estimating ancestral frequencies at the root (the sums of the squared differences are, in the same order as before, 0.018, 0.0025, and 0.0022 for tryptophan and 0.398, 0.112, and 0.098 for alanine). This might be explained by the fact that COA is equally sensitive to deviations in rare amino acids as it is to deviations in frequent amino acids.

Bottom Line: This model, henceforth called Correspondence and likelihood analysis (COaLA), makes use of a correspondence analysis to reduce the number of parameters to be optimized through maximum likelihood, focusing on most of the compositional variation observed in the data.COaLA efficiently estimates ancestral amino acid frequencies and sequences, making it relevant for studies aiming at reconstructing and resurrecting ancestral amino acid sequences.Finally, we applied COaLA on a concatenate of universal amino acid sequences to confirm previous results obtained with a nonhomogeneous Bayesian model regarding the early pattern of adaptation to optimal growth temperature, supporting the mesophilic nature of the Last Universal Common Ancestor.

View Article: PubMed Central - PubMed

Affiliation: Laboratoire de Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, UMR5558, Villeurbanne, France. mathieu.groussin@univ-lyon1.fr

ABSTRACT
Most models of nucleotide or amino acid substitution used in phylogenetic studies assume that the evolutionary process has been homogeneous across lineages and that composition of nucleotides or amino acids has remained the same throughout the tree. These oversimplified assumptions are refuted by the observation that compositional variability characterizes extant biological sequences. Branch-heterogeneous models of protein evolution that account for compositional variability have been developed, but are not yet in common use because of the large number of parameters required, leading to high computational costs and potential overparameterization. Here, we present a new branch-nonhomogeneous and nonstationary model of protein evolution that captures more accurately the high complexity of sequence evolution. This model, henceforth called Correspondence and likelihood analysis (COaLA), makes use of a correspondence analysis to reduce the number of parameters to be optimized through maximum likelihood, focusing on most of the compositional variation observed in the data. The model was thoroughly tested on both simulated and biological data sets to show its high performance in terms of data fitting and CPU time. COaLA efficiently estimates ancestral amino acid frequencies and sequences, making it relevant for studies aiming at reconstructing and resurrecting ancestral amino acid sequences. Finally, we applied COaLA on a concatenate of universal amino acid sequences to confirm previous results obtained with a nonhomogeneous Bayesian model regarding the early pattern of adaptation to optimal growth temperature, supporting the mesophilic nature of the Last Universal Common Ancestor.

Show MeSH