Limits...
CodonPhyML: fast maximum likelihood phylogeny estimation under codon substitution models.

Gil M, Zanetti MS, Zoller S, Anisimova M - Mol. Biol. Evol. (2013)

Bottom Line: Here, we present a fast maximum likelihood (ML) package for phylogenetic inference, CodonPhyML offering hundreds of different codon models, the largest variety to date, for phylogeny inference by ML.CodonPhyML is tested on simulated and real data and is shown to offer excellent speed and convergence properties.In addition, CodonPhyML includes most recent fast methods for estimating phylogenetic branch supports and provides an integral framework for models selection, including amino acid and DNA models.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, Swiss Federal Institute of Technology, Zürich, Switzerland. manuel.gil.sci@gmail.com

ABSTRACT
Markov models of codon substitution naturally incorporate the structure of the genetic code and the selection intensity at the protein level, providing a more realistic representation of protein-coding sequences compared with nucleotide or amino acid models. Thus, for protein-coding genes, phylogenetic inference is expected to be more accurate under codon models. So far, phylogeny reconstruction under codon models has been elusive due to computational difficulties of dealing with high dimension matrices. Here, we present a fast maximum likelihood (ML) package for phylogenetic inference, CodonPhyML offering hundreds of different codon models, the largest variety to date, for phylogeny inference by ML. CodonPhyML is tested on simulated and real data and is shown to offer excellent speed and convergence properties. In addition, CodonPhyML includes most recent fast methods for estimating phylogenetic branch supports and provides an integral framework for models selection, including amino acid and DNA models.

Show MeSH
Comparison of running time, tree topology, and model fit (AICc) between various empirical, semiparametric, and parametric models on three real data sets. Seven-transmembrane receptor from the rhodopsin family of metazoa (A), the mammalian EPH receptor A4 (B), and the mammalian Integrin β1 binding protein 1 (C). The top part of each subfigure provides the CPU time till convergence for each of the evaluated models using CodonPhyML with default parameters, in particular, without multicore support. The lower part of each subfigure shows a color-coded similarity matrix representing the normalized AICc score differences (in the upper triangle) and the normalized RF distance (lower triangle) between any pair of models tested. Both top and lower parts of subfigures utilize the same model order and exact mapping of model locations to the matrix. For readability, gray lines define partitions between different types of models: P, parametric; E, empirical; and SP, semiparametric. The comparison of models is performed from the model on the vertical scale to the model on the horizontal scale. In the upper triangle: For AICc differences, color ranges from blue (<0 or “better fit”) to green (=0 or “same fit”) and to red (>0 or “worse fit”). For example, amino acid models fit data better than codon models for data in (A) but worse for data in (B) and (C). For RF distance, because truth is unknown, the notions of “better” or “worse” cannot be defined. Thus, only the extent of differences in RF distance is shown in the lower triangle: color ranges from green (=0 or “same topology”) to red (>0 or “very different topology”). For example, the choice of a codon or an amino acid model leads to significant differences in the inferred ML topology.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3649670&req=5

mst034-F3: Comparison of running time, tree topology, and model fit (AICc) between various empirical, semiparametric, and parametric models on three real data sets. Seven-transmembrane receptor from the rhodopsin family of metazoa (A), the mammalian EPH receptor A4 (B), and the mammalian Integrin β1 binding protein 1 (C). The top part of each subfigure provides the CPU time till convergence for each of the evaluated models using CodonPhyML with default parameters, in particular, without multicore support. The lower part of each subfigure shows a color-coded similarity matrix representing the normalized AICc score differences (in the upper triangle) and the normalized RF distance (lower triangle) between any pair of models tested. Both top and lower parts of subfigures utilize the same model order and exact mapping of model locations to the matrix. For readability, gray lines define partitions between different types of models: P, parametric; E, empirical; and SP, semiparametric. The comparison of models is performed from the model on the vertical scale to the model on the horizontal scale. In the upper triangle: For AICc differences, color ranges from blue (<0 or “better fit”) to green (=0 or “same fit”) and to red (>0 or “worse fit”). For example, amino acid models fit data better than codon models for data in (A) but worse for data in (B) and (C). For RF distance, because truth is unknown, the notions of “better” or “worse” cannot be defined. Thus, only the extent of differences in RF distance is shown in the lower triangle: color ranges from green (=0 or “same topology”) to red (>0 or “very different topology”). For example, the choice of a codon or an amino acid model leads to significant differences in the inferred ML topology.

Mentions: For real data (described in table 2), we observed that the algorithm typically converged faster under empirical codon models compared with either parametric or semiparametric models (e.g., see fig. 2). This trend was especially prominent for large and diverged data sets (e.g., R10–R11), where in addition, semiparametric models had much faster convergence compared with their parametric equivalents (e.g., see fig. 3). One reason why empirical and semiparametric models tend to converge faster compared with equivalent parametric models may be due to the improved model fit facilitated by their empirical elements inferred from large amounts of data. As these estimates capture some global patterns present in real data, the consequent improved model fit may be also reflected in the properties of the corresponding log-likelihood landscape.Fig. 2.


CodonPhyML: fast maximum likelihood phylogeny estimation under codon substitution models.

Gil M, Zanetti MS, Zoller S, Anisimova M - Mol. Biol. Evol. (2013)

Comparison of running time, tree topology, and model fit (AICc) between various empirical, semiparametric, and parametric models on three real data sets. Seven-transmembrane receptor from the rhodopsin family of metazoa (A), the mammalian EPH receptor A4 (B), and the mammalian Integrin β1 binding protein 1 (C). The top part of each subfigure provides the CPU time till convergence for each of the evaluated models using CodonPhyML with default parameters, in particular, without multicore support. The lower part of each subfigure shows a color-coded similarity matrix representing the normalized AICc score differences (in the upper triangle) and the normalized RF distance (lower triangle) between any pair of models tested. Both top and lower parts of subfigures utilize the same model order and exact mapping of model locations to the matrix. For readability, gray lines define partitions between different types of models: P, parametric; E, empirical; and SP, semiparametric. The comparison of models is performed from the model on the vertical scale to the model on the horizontal scale. In the upper triangle: For AICc differences, color ranges from blue (<0 or “better fit”) to green (=0 or “same fit”) and to red (>0 or “worse fit”). For example, amino acid models fit data better than codon models for data in (A) but worse for data in (B) and (C). For RF distance, because truth is unknown, the notions of “better” or “worse” cannot be defined. Thus, only the extent of differences in RF distance is shown in the lower triangle: color ranges from green (=0 or “same topology”) to red (>0 or “very different topology”). For example, the choice of a codon or an amino acid model leads to significant differences in the inferred ML topology.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3649670&req=5

mst034-F3: Comparison of running time, tree topology, and model fit (AICc) between various empirical, semiparametric, and parametric models on three real data sets. Seven-transmembrane receptor from the rhodopsin family of metazoa (A), the mammalian EPH receptor A4 (B), and the mammalian Integrin β1 binding protein 1 (C). The top part of each subfigure provides the CPU time till convergence for each of the evaluated models using CodonPhyML with default parameters, in particular, without multicore support. The lower part of each subfigure shows a color-coded similarity matrix representing the normalized AICc score differences (in the upper triangle) and the normalized RF distance (lower triangle) between any pair of models tested. Both top and lower parts of subfigures utilize the same model order and exact mapping of model locations to the matrix. For readability, gray lines define partitions between different types of models: P, parametric; E, empirical; and SP, semiparametric. The comparison of models is performed from the model on the vertical scale to the model on the horizontal scale. In the upper triangle: For AICc differences, color ranges from blue (<0 or “better fit”) to green (=0 or “same fit”) and to red (>0 or “worse fit”). For example, amino acid models fit data better than codon models for data in (A) but worse for data in (B) and (C). For RF distance, because truth is unknown, the notions of “better” or “worse” cannot be defined. Thus, only the extent of differences in RF distance is shown in the lower triangle: color ranges from green (=0 or “same topology”) to red (>0 or “very different topology”). For example, the choice of a codon or an amino acid model leads to significant differences in the inferred ML topology.
Mentions: For real data (described in table 2), we observed that the algorithm typically converged faster under empirical codon models compared with either parametric or semiparametric models (e.g., see fig. 2). This trend was especially prominent for large and diverged data sets (e.g., R10–R11), where in addition, semiparametric models had much faster convergence compared with their parametric equivalents (e.g., see fig. 3). One reason why empirical and semiparametric models tend to converge faster compared with equivalent parametric models may be due to the improved model fit facilitated by their empirical elements inferred from large amounts of data. As these estimates capture some global patterns present in real data, the consequent improved model fit may be also reflected in the properties of the corresponding log-likelihood landscape.Fig. 2.

Bottom Line: Here, we present a fast maximum likelihood (ML) package for phylogenetic inference, CodonPhyML offering hundreds of different codon models, the largest variety to date, for phylogeny inference by ML.CodonPhyML is tested on simulated and real data and is shown to offer excellent speed and convergence properties.In addition, CodonPhyML includes most recent fast methods for estimating phylogenetic branch supports and provides an integral framework for models selection, including amino acid and DNA models.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, Swiss Federal Institute of Technology, Zürich, Switzerland. manuel.gil.sci@gmail.com

ABSTRACT
Markov models of codon substitution naturally incorporate the structure of the genetic code and the selection intensity at the protein level, providing a more realistic representation of protein-coding sequences compared with nucleotide or amino acid models. Thus, for protein-coding genes, phylogenetic inference is expected to be more accurate under codon models. So far, phylogeny reconstruction under codon models has been elusive due to computational difficulties of dealing with high dimension matrices. Here, we present a fast maximum likelihood (ML) package for phylogenetic inference, CodonPhyML offering hundreds of different codon models, the largest variety to date, for phylogeny inference by ML. CodonPhyML is tested on simulated and real data and is shown to offer excellent speed and convergence properties. In addition, CodonPhyML includes most recent fast methods for estimating phylogenetic branch supports and provides an integral framework for models selection, including amino acid and DNA models.

Show MeSH