Bayesian species delimitation can be robust to guide-tree inference errors.
View Article:
PubMed Central - PubMed
Affiliation: Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Box 50007, SE-104 05 Stockholm, Sweden; Genome Center and Section of Evolution and Ecology, University of California at Davis, One Shields Avenue, Davis, CA 95616, USA; Department of Genetics, Evolution and Environment, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK; and Center for Computational Genomics, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.
The Bayesian method uses Bayesian model selection to compare different species-delimitation models in the multispecies coalescent framework, and uses reversible-jump Markov chain Monte Carlo (rjMCMC) to estimate the posterior probabilities for different delimitation models... The method accommodates multiple loci, and does not require reciprocal monophyly of inferred gene trees... The three or five sequences from each of the eight populations were constrained to be monophyletic... The substitution model used was GTR, since RAxML does not implement the JC69 model... Also, RAxML does not implement the molecular clock and infers unrooted trees instead... Given the guide tree, the nuclear sequence data (either one locus or five loci) simulated above were analyzed using bpp version 2.2 to delimit species... Note that we used only the population tree topology inferred by the two methods (RAxML/beast and *beast), and ignored any support measures for clades on the tree, such as the bootstrap support values calculated by RAxML and the posterior clade probabilities calculated by *beast... The results show clear effects of the species phylogeny (in particular, the lengths of the internal branches reflecting species divergence times), the mutation rate, and the number of loci... For example, clade ABC in tree 1 is recovered in only 55% of replicate data sets at the low-mutation rate (Fig. 2a)... A single locus at the low-mutation rate does not contain enough information to infer the correct guide tree... The better performance of bpp for the large sample size appears to be largely due to the increased information content for species delimitation since the improvement in guide-tree inference is moderate... A previous simulation found that increasing the number of sequences sampled from the same species improves species delimitation by bpp, leading to both reduction of false positives (over-splitting errors) and increase of power (correctly delimiting distinct species)... In our simulation, we assumed no gene flow (migration, hybridization, or introgression) after species divergence, and conflicts between gene trees from different genomic regions or between mitochondrial and nuclear loci are entirely due to ancestral polymorphism and incomplete lineage sorting... Although the results suggest that a few loci of sequence data are insufficient for structurama to assign individuals to populations reliably, the impact of assignment errors on species delimitation by bpp under more realistic scenarios remains unknown. |
Related In:
Results -
Collection
License getmorefigures.php?uid=PMC4195854&req=5
Mentions: We used two species trees, each of four species, to simulate the sequence data under the multispecies coalescent model (Rannala and Yang 2003). Tree 1 is balanced while tree 2 is unbalanced (Fig. 1). Parameters in the model include three species divergence times (s) as well as seven population size parameters (s) for the extant and extinct species on the species tree. Both and are measured by the expected number of mutations per site. For example, in species tree 1 of Figure 1a means that the average sequence divergence from the time of the AB ancestor to the present is 1%, whereas means that two random sequences sampled from the same population are 2% different on average. We assumed that each of the four species (A–D) consisted of two populations (labeled , etc.), so that there are eight populations on the guide tree. We consider two sample sizes, with three or five sequences sampled at each locus from each of the eight populations (i.e., with 24 or 40 sequences in the alignment for each locus). The program mccoal in bpp version 2.1c was used to generate gene trees with coalescent times (branch lengths) under the multispecies coalescent model (Rannala and Yang 2003) and to simulate sequence alignments given the gene trees. For each species tree, two sets of parameters were used, mimicking two different mutation rates (Table 1). We simulated either one or five nuclear loci as well as one mitochondrial locus. We assumed that the mutation rate at the mitochondrial locus was 20 times as high as at a nuclear locus and that the population size was one-fourth that for a nuclear locus so that and (e.g., Zhou et al. 2012). The JC69 mutation model (Jukes and Cantor 1969) was assumed both to generate and to analyze the sequence alignments. Note that the role of the mutation model here is to correct for multiple hits to estimate the gene tree topology and branch lengths, and that JC69 is deemed adequate for analysis of such highly similar sequences (Burgess and Yang 2008); in previous studies, even the infinite sites model produced very similar results (Satta et al. 2004). |
View Article: PubMed Central - PubMed
Affiliation: Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Box 50007, SE-104 05 Stockholm, Sweden; Genome Center and Section of Evolution and Ecology, University of California at Davis, One Shields Avenue, Davis, CA 95616, USA; Department of Genetics, Evolution and Environment, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK; and Center for Computational Genomics, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.
The Bayesian method uses Bayesian model selection to compare different species-delimitation models in the multispecies coalescent framework, and uses reversible-jump Markov chain Monte Carlo (rjMCMC) to estimate the posterior probabilities for different delimitation models... The method accommodates multiple loci, and does not require reciprocal monophyly of inferred gene trees... The three or five sequences from each of the eight populations were constrained to be monophyletic... The substitution model used was GTR, since RAxML does not implement the JC69 model... Also, RAxML does not implement the molecular clock and infers unrooted trees instead... Given the guide tree, the nuclear sequence data (either one locus or five loci) simulated above were analyzed using bpp version 2.2 to delimit species... Note that we used only the population tree topology inferred by the two methods (RAxML/beast and *beast), and ignored any support measures for clades on the tree, such as the bootstrap support values calculated by RAxML and the posterior clade probabilities calculated by *beast... The results show clear effects of the species phylogeny (in particular, the lengths of the internal branches reflecting species divergence times), the mutation rate, and the number of loci... For example, clade ABC in tree 1 is recovered in only 55% of replicate data sets at the low-mutation rate (Fig. 2a)... A single locus at the low-mutation rate does not contain enough information to infer the correct guide tree... The better performance of bpp for the large sample size appears to be largely due to the increased information content for species delimitation since the improvement in guide-tree inference is moderate... A previous simulation found that increasing the number of sequences sampled from the same species improves species delimitation by bpp, leading to both reduction of false positives (over-splitting errors) and increase of power (correctly delimiting distinct species)... In our simulation, we assumed no gene flow (migration, hybridization, or introgression) after species divergence, and conflicts between gene trees from different genomic regions or between mitochondrial and nuclear loci are entirely due to ancestral polymorphism and incomplete lineage sorting... Although the results suggest that a few loci of sequence data are insufficient for structurama to assign individuals to populations reliably, the impact of assignment errors on species delimitation by bpp under more realistic scenarios remains unknown.