Limits...
Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration.

Gavryushkina A, Welch D, Stadler T, Drummond AJ - PLoS Comput. Biol. (2014)

Bottom Line: We show that even if sampled ancestors are not of specific interest in an analysis, failing to account for them leads to significant bias in parameter estimates.We also apply the method to infer divergence times and diversification rates when fossils are included along with extant species samples, so that fossilisation events are modelled as a part of the tree branching process.Such modelling has many advantages as argued in the literature.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, University of Auckland, Auckland, New Zealand; Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand.

ABSTRACT
Phylogenetic analyses which include fossils or molecular sequences that are sampled through time require models that allow one sample to be a direct ancestor of another sample. As previously available phylogenetic inference tools assume that all samples are tips, they do not allow for this possibility. We have developed and implemented a Bayesian Markov Chain Monte Carlo (MCMC) algorithm to infer what we call sampled ancestor trees, that is, trees in which sampled individuals can be direct ancestors of other sampled individuals. We use a family of birth-death models where individuals may remain in the tree process after sampling, in particular we extend the birth-death skyline model [Stadler et al., 2013] to sampled ancestor trees. This method allows the detection of sampled ancestors as well as estimation of the probability that an individual will be removed from the process when it is sampled. We show that even if sampled ancestors are not of specific interest in an analysis, failing to account for them leads to significant bias in parameter estimates. We also show that sampled ancestor birth-death models where every sample comes from a different time point are non-identifiable and thus require one parameter to be known in order to infer other parameters. We apply our phylogenetic inference accounting for sampled ancestors to epidemiological data, where the possibility of sampled ancestors enables us to identify individuals that infected other individuals after being sampled and to infer fundamental epidemiological parameters. We also apply the method to infer divergence times and diversification rates when fossils are included along with extant species samples, so that fossilisation events are modelled as a part of the tree branching process. Such modelling has many advantages as argued in the literature. The sampler is available as an open-source BEAST2 package (https://github.com/CompEvol/sampled-ancestors).

Show MeSH
Properties of the tree estimated from simulated data (fossilized birth-death process).The graph shows median estimates (black dots) and 95% HPD intervals (grey lines) against true values for the tree height (on the left) and number of sampled ancestors (on the right). The upper row shows the estimates obtained from the analyses of simulated sequence data of all sampled nodes and the bottom row shows the estimates from the analyses where only sequence data from the extant samples was used.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4263412&req=5

pcbi-1003919-g003: Properties of the tree estimated from simulated data (fossilized birth-death process).The graph shows median estimates (black dots) and 95% HPD intervals (grey lines) against true values for the tree height (on the left) and number of sampled ancestors (on the right). The upper row shows the estimates obtained from the analyses of simulated sequence data of all sampled nodes and the bottom row shows the estimates from the analyses where only sequence data from the extant samples was used.

Mentions: For the fossilized birth-death process (the process with -sampling and zero removal probability), we simulated a set of trees under a fixed set of the tree model parameters. In the case when we analysed sequence data of all sampled nodes, each parameter was estimated and, in the worst case, the median of the relative errors for all runs was 0.22 (0.24 for the analyses without -sampled sequences). The median of the relative errors for tree properties, such as the time of origin, tree height and number of sampled ancestors, was at most 0.09 (0.14 without -sampled sequences). The true parameters and tree properties were within the estimated 95% HPD intervals at least 95% (93% without -sampled sequences) of the time in all cases. The estimates of the number of sampled ancestors and the tree height for both cases are shown in Figure 3. Figure 4 shows how the amount of uncertainty in estimates of turnover rate decreases with the size of the tree (i.e., with the number of sampled nodes) and increases when the sequences of -sampled nodes are discarded. Overall removing sequence data of -sampled nodes led to larger errors and increased 95% intervals. The median of errors for the turnover rate and sampling proportion were comparable as was the coverage for all macroevolutionary parameters. This might be due to fixing to the truth. The detailed results of this set of simulations can be found in Supporting Information (Table 4 in Text S1).


Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration.

Gavryushkina A, Welch D, Stadler T, Drummond AJ - PLoS Comput. Biol. (2014)

Properties of the tree estimated from simulated data (fossilized birth-death process).The graph shows median estimates (black dots) and 95% HPD intervals (grey lines) against true values for the tree height (on the left) and number of sampled ancestors (on the right). The upper row shows the estimates obtained from the analyses of simulated sequence data of all sampled nodes and the bottom row shows the estimates from the analyses where only sequence data from the extant samples was used.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4263412&req=5

pcbi-1003919-g003: Properties of the tree estimated from simulated data (fossilized birth-death process).The graph shows median estimates (black dots) and 95% HPD intervals (grey lines) against true values for the tree height (on the left) and number of sampled ancestors (on the right). The upper row shows the estimates obtained from the analyses of simulated sequence data of all sampled nodes and the bottom row shows the estimates from the analyses where only sequence data from the extant samples was used.
Mentions: For the fossilized birth-death process (the process with -sampling and zero removal probability), we simulated a set of trees under a fixed set of the tree model parameters. In the case when we analysed sequence data of all sampled nodes, each parameter was estimated and, in the worst case, the median of the relative errors for all runs was 0.22 (0.24 for the analyses without -sampled sequences). The median of the relative errors for tree properties, such as the time of origin, tree height and number of sampled ancestors, was at most 0.09 (0.14 without -sampled sequences). The true parameters and tree properties were within the estimated 95% HPD intervals at least 95% (93% without -sampled sequences) of the time in all cases. The estimates of the number of sampled ancestors and the tree height for both cases are shown in Figure 3. Figure 4 shows how the amount of uncertainty in estimates of turnover rate decreases with the size of the tree (i.e., with the number of sampled nodes) and increases when the sequences of -sampled nodes are discarded. Overall removing sequence data of -sampled nodes led to larger errors and increased 95% intervals. The median of errors for the turnover rate and sampling proportion were comparable as was the coverage for all macroevolutionary parameters. This might be due to fixing to the truth. The detailed results of this set of simulations can be found in Supporting Information (Table 4 in Text S1).

Bottom Line: We show that even if sampled ancestors are not of specific interest in an analysis, failing to account for them leads to significant bias in parameter estimates.We also apply the method to infer divergence times and diversification rates when fossils are included along with extant species samples, so that fossilisation events are modelled as a part of the tree branching process.Such modelling has many advantages as argued in the literature.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, University of Auckland, Auckland, New Zealand; Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand.

ABSTRACT
Phylogenetic analyses which include fossils or molecular sequences that are sampled through time require models that allow one sample to be a direct ancestor of another sample. As previously available phylogenetic inference tools assume that all samples are tips, they do not allow for this possibility. We have developed and implemented a Bayesian Markov Chain Monte Carlo (MCMC) algorithm to infer what we call sampled ancestor trees, that is, trees in which sampled individuals can be direct ancestors of other sampled individuals. We use a family of birth-death models where individuals may remain in the tree process after sampling, in particular we extend the birth-death skyline model [Stadler et al., 2013] to sampled ancestor trees. This method allows the detection of sampled ancestors as well as estimation of the probability that an individual will be removed from the process when it is sampled. We show that even if sampled ancestors are not of specific interest in an analysis, failing to account for them leads to significant bias in parameter estimates. We also show that sampled ancestor birth-death models where every sample comes from a different time point are non-identifiable and thus require one parameter to be known in order to infer other parameters. We apply our phylogenetic inference accounting for sampled ancestors to epidemiological data, where the possibility of sampled ancestors enables us to identify individuals that infected other individuals after being sampled and to infer fundamental epidemiological parameters. We also apply the method to infer divergence times and diversification rates when fossils are included along with extant species samples, so that fossilisation events are modelled as a part of the tree branching process. Such modelling has many advantages as argued in the literature. The sampler is available as an open-source BEAST2 package (https://github.com/CompEvol/sampled-ancestors).

Show MeSH