Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference.
Bottom Line: Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs.Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong.By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms.
Affiliation: Department of Computer Science, ETH Zurich, Universitätstr. 6, 8092 Zurich, Switzerland, Department of Molecular Sciences, Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London, UK; MRC Clinical Sciences Centre, London W12 0NN, UK;Show MeSH
Mentions: Because the species tree discordance test requires trusted, fully resolved topologies, it tends to be restricted to small sets of sequences. We chose to be very cautious and to use sequences from small six-species sets for which the species tree is incontrovertible (see section “Methods”). The downside of this approach is that findings based on such six-sequence alignments might not generalize well to larger alignments. To test larger alignments, one could use larger trusted topologies, but larger topologies can be more difficult to defend. Instead, we have developed a new test variant, which we call the “enriched species tree discordance test” (Fig. 1). In this variant, the original six orthologs are augmented with a number of homologous sequences. The combined set is then aligned and the various filtering methods are applied to the resulting MSAs. Subsequently, trees are estimated from the filtered and unfiltered MSAs. To evaluate the resulting trees, all leaves corresponding to additional homologs are pruned, leaving the subtree consisting of the original six taxa only, whose topology can be compared to the reference species topology. If available, larger trusted topologies would be preferable, because they would allow us to measure all topological differences in the augmented trees and thereby confer a higher statistical power to the test. But as long as the systematic improvement or worsening also affects the placement of the original sequences, this protocol makes it possible to compare the quality of larger alignments and trees without requiring the trusted topology to be larger (i.e., better resolved).
Affiliation: Department of Computer Science, ETH Zurich, Universitätstr. 6, 8092 Zurich, Switzerland, Department of Molecular Sciences, Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London, UK; MRC Clinical Sciences Centre, London W12 0NN, UK;