Limits...
Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference.

Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M, Dessimoz C - Syst. Biol. (2015)

Bottom Line: We confirm that our findings hold for a wide range of parameters and methods.Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference.By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, ETH Zurich, Universitätstr. 6, 8092 Zurich, Switzerland, Department of Molecular Sciences, Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London, UK; MRC Clinical Sciences Centre, London W12 0NN, UK;

Show MeSH

Related in: MedlinePlus

Schematic of the species tree discordance test used to evaluate filtering methods. The grey (green in online version) elements indicate extra steps involved in the enriched version of the test. The tests sample sets of orthologs with an undisputed phylogeny (black sequences). The enriched test adds homologous sequences with unknown branching order (green in online version). The input sequences are aligned and then filtered by the different filtering methods. The filtered alignments are evaluated by reconstructing trees from them, which are compared with the reference topology (red in online version). In the enriched test, all additional sequences are removed from the tree and what remains (subtree relating the orthologous sequences) is compared to the reference topology. The unfiltered alignment is evaluated in the same way. All others being equal, the relative performance of the filtering methods can be assessed by their average congruence with the reference topology over a large number of input problems.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4538881&req=5

Figure 1: Schematic of the species tree discordance test used to evaluate filtering methods. The grey (green in online version) elements indicate extra steps involved in the enriched version of the test. The tests sample sets of orthologs with an undisputed phylogeny (black sequences). The enriched test adds homologous sequences with unknown branching order (green in online version). The input sequences are aligned and then filtered by the different filtering methods. The filtered alignments are evaluated by reconstructing trees from them, which are compared with the reference topology (red in online version). In the enriched test, all additional sequences are removed from the tree and what remains (subtree relating the orthologous sequences) is compared to the reference topology. The unfiltered alignment is evaluated in the same way. All others being equal, the relative performance of the filtering methods can be assessed by their average congruence with the reference topology over a large number of input problems.

Mentions: Because the species tree discordance test requires trusted, fully resolved topologies, it tends to be restricted to small sets of sequences. We chose to be very cautious and to use sequences from small six-species sets for which the species tree is incontrovertible (see section “Methods”). The downside of this approach is that findings based on such six-sequence alignments might not generalize well to larger alignments. To test larger alignments, one could use larger trusted topologies, but larger topologies can be more difficult to defend. Instead, we have developed a new test variant, which we call the “enriched species tree discordance test” (Fig. 1). In this variant, the original six orthologs are augmented with a number of homologous sequences. The combined set is then aligned and the various filtering methods are applied to the resulting MSAs. Subsequently, trees are estimated from the filtered and unfiltered MSAs. To evaluate the resulting trees, all leaves corresponding to additional homologs are pruned, leaving the subtree consisting of the original six taxa only, whose topology can be compared to the reference species topology. If available, larger trusted topologies would be preferable, because they would allow us to measure all topological differences in the augmented trees and thereby confer a higher statistical power to the test. But as long as the systematic improvement or worsening also affects the placement of the original sequences, this protocol makes it possible to compare the quality of larger alignments and trees without requiring the trusted topology to be larger (i.e., better resolved).


Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference.

Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M, Dessimoz C - Syst. Biol. (2015)

Schematic of the species tree discordance test used to evaluate filtering methods. The grey (green in online version) elements indicate extra steps involved in the enriched version of the test. The tests sample sets of orthologs with an undisputed phylogeny (black sequences). The enriched test adds homologous sequences with unknown branching order (green in online version). The input sequences are aligned and then filtered by the different filtering methods. The filtered alignments are evaluated by reconstructing trees from them, which are compared with the reference topology (red in online version). In the enriched test, all additional sequences are removed from the tree and what remains (subtree relating the orthologous sequences) is compared to the reference topology. The unfiltered alignment is evaluated in the same way. All others being equal, the relative performance of the filtering methods can be assessed by their average congruence with the reference topology over a large number of input problems.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4538881&req=5

Figure 1: Schematic of the species tree discordance test used to evaluate filtering methods. The grey (green in online version) elements indicate extra steps involved in the enriched version of the test. The tests sample sets of orthologs with an undisputed phylogeny (black sequences). The enriched test adds homologous sequences with unknown branching order (green in online version). The input sequences are aligned and then filtered by the different filtering methods. The filtered alignments are evaluated by reconstructing trees from them, which are compared with the reference topology (red in online version). In the enriched test, all additional sequences are removed from the tree and what remains (subtree relating the orthologous sequences) is compared to the reference topology. The unfiltered alignment is evaluated in the same way. All others being equal, the relative performance of the filtering methods can be assessed by their average congruence with the reference topology over a large number of input problems.
Mentions: Because the species tree discordance test requires trusted, fully resolved topologies, it tends to be restricted to small sets of sequences. We chose to be very cautious and to use sequences from small six-species sets for which the species tree is incontrovertible (see section “Methods”). The downside of this approach is that findings based on such six-sequence alignments might not generalize well to larger alignments. To test larger alignments, one could use larger trusted topologies, but larger topologies can be more difficult to defend. Instead, we have developed a new test variant, which we call the “enriched species tree discordance test” (Fig. 1). In this variant, the original six orthologs are augmented with a number of homologous sequences. The combined set is then aligned and the various filtering methods are applied to the resulting MSAs. Subsequently, trees are estimated from the filtered and unfiltered MSAs. To evaluate the resulting trees, all leaves corresponding to additional homologs are pruned, leaving the subtree consisting of the original six taxa only, whose topology can be compared to the reference species topology. If available, larger trusted topologies would be preferable, because they would allow us to measure all topological differences in the augmented trees and thereby confer a higher statistical power to the test. But as long as the systematic improvement or worsening also affects the placement of the original sequences, this protocol makes it possible to compare the quality of larger alignments and trees without requiring the trusted topology to be larger (i.e., better resolved).

Bottom Line: We confirm that our findings hold for a wide range of parameters and methods.Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference.By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, ETH Zurich, Universitätstr. 6, 8092 Zurich, Switzerland, Department of Molecular Sciences, Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London, UK; MRC Clinical Sciences Centre, London W12 0NN, UK;

Show MeSH
Related in: MedlinePlus