Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference.
Bottom Line: Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs.Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong.By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms.
Affiliation: Department of Computer Science, ETH Zurich, Universitätstr. 6, 8092 Zurich, Switzerland, Department of Molecular Sciences, Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London, UK; MRC Clinical Sciences Centre, London W12 0NN, UK;Show MeSH
Related in: MedlinePlus
Mentions: We decomposed the effect of the various filtering methods on the false positive and false negative rates, using the enriched species discordance test and taking the approximate Bayesian posterior (Anisimova et al. 2011) as the measure of support for each branch. Figure 3 shows the results of this analysis applied to amino acid sequences. As expected, filtering consistently led to an increase in the false negative rate for all conditions. This is consistent with our other observations, which indicate that many phylogenetically informative sites are lost to alignment filtering. Worryingly, the false positive rate also increased in many combinations, particularly when we used lower minimum support thresholds (0.75 and 0.9) in the Fungi and Eukaryote data sets. With more stringent thresholds (0.95 and 0.99), the impact of filtering on the false positive rate was less pronounced but in many cases still detrimental. Only in the Bacteria data set did filtering lead to a decrease in the false positive rate. These observations also broadly hold for the nucleotide alignments (Supplementary Fig. 11 available on Dryad at http://dx.doi.org/10.5061/dryad.pc5j0).
Affiliation: Department of Computer Science, ETH Zurich, Universitätstr. 6, 8092 Zurich, Switzerland, Department of Molecular Sciences, Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London, UK; MRC Clinical Sciences Centre, London W12 0NN, UK;