Limits...
Probabilistic phylogenetic inference with insertions and deletions.

Rivas E, Eddy SR - PLoS Comput. Biol. (2008)

Bottom Line: A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time.However, the most widely used phylogenetic models only account for residue substitution events.We apply this model to phylogenetic tree inference by extending the program dnaml in phylip.

View Article: PubMed Central - PubMed

Affiliation: Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, Virginia, United States of America. rivase@janelia.hhmi.org

ABSTRACT
A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth-death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new "concordance test" benchmark on real ribosomal RNA alignments, we show that the extended program dnamlepsilon improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.

Show MeSH

Related in: MedlinePlus

Comparison of dnaml versus dnamlε for synthetic alignments with gaps.Tree reconstruction test for simulated alignments with gaps generated with the program rose [76] according to 100 random eight-taxon trees, using the F84 model for substitutions and a Poisson gap length distribution (, for λ = 0.5). Alignments lengths range from 50 to 1,000 residues. Here we compare the performance of dnaml (left) versus dnamlε (right) for two different evolutionary situations: in red alignments with no gaps and 0.06 substitutions per site and branch, which produce alignments with 21%±6% pairwise average substitutions; in blue alignments with gaps (pins = pdel = 0.001 and 0.07 substitutions per site and branch) with a similar percentage of pairwise substitutions (20%±5%) as the ungapped alignments. The alignments with gaps have a percentage of pairwise gaps of 13%±4%. Results are presented as a function of the geometric mean of sequences lengths. The tree reconstruction test assesses the similarity between the inferred tree and the original tree. Three measures of tree similarity are displayed in (A), (B), and (C), respectively: a binary count of whether the trees are topologically identical or not (TP), the Symmetric Difference Distance (SDD) and the normalized Branch Scoring Distance (nBSD). (D) Mean branch length of the inferred trees, and (E) the running time required for the different inferences.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2527138&req=5

pcbi-1000172-g003: Comparison of dnaml versus dnamlε for synthetic alignments with gaps.Tree reconstruction test for simulated alignments with gaps generated with the program rose [76] according to 100 random eight-taxon trees, using the F84 model for substitutions and a Poisson gap length distribution (, for λ = 0.5). Alignments lengths range from 50 to 1,000 residues. Here we compare the performance of dnaml (left) versus dnamlε (right) for two different evolutionary situations: in red alignments with no gaps and 0.06 substitutions per site and branch, which produce alignments with 21%±6% pairwise average substitutions; in blue alignments with gaps (pins = pdel = 0.001 and 0.07 substitutions per site and branch) with a similar percentage of pairwise substitutions (20%±5%) as the ungapped alignments. The alignments with gaps have a percentage of pairwise gaps of 13%±4%. Results are presented as a function of the geometric mean of sequences lengths. The tree reconstruction test assesses the similarity between the inferred tree and the original tree. Three measures of tree similarity are displayed in (A), (B), and (C), respectively: a binary count of whether the trees are topologically identical or not (TP), the Symmetric Difference Distance (SDD) and the normalized Branch Scoring Distance (nBSD). (D) Mean branch length of the inferred trees, and (E) the running time required for the different inferences.

Mentions: The results are shown in Figure 2. As expected, the accuracy of tree topologies inferred by the two methods is essentially identical by the TP and SDD measures. As in any phylogenetic inference method, accuracy improves with alignment length (more data is better), and shows an optimum average branch length of about 0.05, corresponding to average pairwise sequence identities of about 82% (more similar sequences have fewer substitutions and less signal, and less similar sequences are more saturated). In the best cases (0.05 branch length, alignments longer than 800 nt) both methods infer the correct tree about 78% of the time (Figure 3A). nBSD values are also essentially identical for most choices of average branch length. We do see significant differences in branch length estimation (either by the nBSD test, or by plotting average inferred branch length) at large, saturating choices of average branch lengths of 1.0 or 2.0, corresponding to almost uncorrelated random sequences. In this extreme (and not biologically relevant) regime, branch lengths make little difference in likelihood so long as they are large (and accordingly, likelihoods assigned by the two methods are not significantly different, despite the different inferred branch lengths), and the inferred branch lengths become sensitive to details of the optimization method. Thus these extreme cases appear to be identifying a biologically irrelevant weakness in dnaml's optimizer, which infers overly large branch lengths (5.7±0.5 instead of 2.0 for the largest alignments), whereas dnamlε results, produced by a different implementation of optimizer, are closer to the correct value (1.55±0.05). Overall, with the exception of this minor difference in numerical optimization, these results indicate that dnamlε and dnaml do produce essentially identical results on ungapped alignments, as expected.


Probabilistic phylogenetic inference with insertions and deletions.

Rivas E, Eddy SR - PLoS Comput. Biol. (2008)

Comparison of dnaml versus dnamlε for synthetic alignments with gaps.Tree reconstruction test for simulated alignments with gaps generated with the program rose [76] according to 100 random eight-taxon trees, using the F84 model for substitutions and a Poisson gap length distribution (, for λ = 0.5). Alignments lengths range from 50 to 1,000 residues. Here we compare the performance of dnaml (left) versus dnamlε (right) for two different evolutionary situations: in red alignments with no gaps and 0.06 substitutions per site and branch, which produce alignments with 21%±6% pairwise average substitutions; in blue alignments with gaps (pins = pdel = 0.001 and 0.07 substitutions per site and branch) with a similar percentage of pairwise substitutions (20%±5%) as the ungapped alignments. The alignments with gaps have a percentage of pairwise gaps of 13%±4%. Results are presented as a function of the geometric mean of sequences lengths. The tree reconstruction test assesses the similarity between the inferred tree and the original tree. Three measures of tree similarity are displayed in (A), (B), and (C), respectively: a binary count of whether the trees are topologically identical or not (TP), the Symmetric Difference Distance (SDD) and the normalized Branch Scoring Distance (nBSD). (D) Mean branch length of the inferred trees, and (E) the running time required for the different inferences.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2527138&req=5

pcbi-1000172-g003: Comparison of dnaml versus dnamlε for synthetic alignments with gaps.Tree reconstruction test for simulated alignments with gaps generated with the program rose [76] according to 100 random eight-taxon trees, using the F84 model for substitutions and a Poisson gap length distribution (, for λ = 0.5). Alignments lengths range from 50 to 1,000 residues. Here we compare the performance of dnaml (left) versus dnamlε (right) for two different evolutionary situations: in red alignments with no gaps and 0.06 substitutions per site and branch, which produce alignments with 21%±6% pairwise average substitutions; in blue alignments with gaps (pins = pdel = 0.001 and 0.07 substitutions per site and branch) with a similar percentage of pairwise substitutions (20%±5%) as the ungapped alignments. The alignments with gaps have a percentage of pairwise gaps of 13%±4%. Results are presented as a function of the geometric mean of sequences lengths. The tree reconstruction test assesses the similarity between the inferred tree and the original tree. Three measures of tree similarity are displayed in (A), (B), and (C), respectively: a binary count of whether the trees are topologically identical or not (TP), the Symmetric Difference Distance (SDD) and the normalized Branch Scoring Distance (nBSD). (D) Mean branch length of the inferred trees, and (E) the running time required for the different inferences.
Mentions: The results are shown in Figure 2. As expected, the accuracy of tree topologies inferred by the two methods is essentially identical by the TP and SDD measures. As in any phylogenetic inference method, accuracy improves with alignment length (more data is better), and shows an optimum average branch length of about 0.05, corresponding to average pairwise sequence identities of about 82% (more similar sequences have fewer substitutions and less signal, and less similar sequences are more saturated). In the best cases (0.05 branch length, alignments longer than 800 nt) both methods infer the correct tree about 78% of the time (Figure 3A). nBSD values are also essentially identical for most choices of average branch length. We do see significant differences in branch length estimation (either by the nBSD test, or by plotting average inferred branch length) at large, saturating choices of average branch lengths of 1.0 or 2.0, corresponding to almost uncorrelated random sequences. In this extreme (and not biologically relevant) regime, branch lengths make little difference in likelihood so long as they are large (and accordingly, likelihoods assigned by the two methods are not significantly different, despite the different inferred branch lengths), and the inferred branch lengths become sensitive to details of the optimization method. Thus these extreme cases appear to be identifying a biologically irrelevant weakness in dnaml's optimizer, which infers overly large branch lengths (5.7±0.5 instead of 2.0 for the largest alignments), whereas dnamlε results, produced by a different implementation of optimizer, are closer to the correct value (1.55±0.05). Overall, with the exception of this minor difference in numerical optimization, these results indicate that dnamlε and dnaml do produce essentially identical results on ungapped alignments, as expected.

Bottom Line: A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time.However, the most widely used phylogenetic models only account for residue substitution events.We apply this model to phylogenetic tree inference by extending the program dnaml in phylip.

View Article: PubMed Central - PubMed

Affiliation: Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, Virginia, United States of America. rivase@janelia.hhmi.org

ABSTRACT
A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth-death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new "concordance test" benchmark on real ribosomal RNA alignments, we show that the extended program dnamlepsilon improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.

Show MeSH
Related in: MedlinePlus