Limits...
Reconstructible phylogenetic networks: do not distinguish the indistinguishable.

Pardi F, Scornavacca C - PLoS Comput. Biol. (2015)

Bottom Line: This identifiability problem is partially solved by accounting for branch lengths, although this merely reduces the frequency of the problem.For any given set of indistinguishable networks, we define a canonical network that, under mild assumptions, is unique and thus representative of the entire set.While on the methodological side this will imply a drastic reduction of the solution space in network inference, for the study of reticulate evolution this is a fundamental limitation that will require an important change of perspective when interpreting phylogenetic networks.

View Article: PubMed Central - PubMed

Affiliation: Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM, UMR 5506) CNRS, Université de Montpellier, France; Institut de Biologie Computationnelle, Montpellier, France.

ABSTRACT
Phylogenetic networks represent the evolution of organisms that have undergone reticulate events, such as recombination, hybrid speciation or lateral gene transfer. An important way to interpret a phylogenetic network is in terms of the trees it displays, which represent all the possible histories of the characters carried by the organisms in the network. Interestingly, however, different networks may display exactly the same set of trees, an observation that poses a problem for network reconstruction: from the perspective of many inference methods such networks are "indistinguishable". This is true for all methods that evaluate a phylogenetic network solely on the basis of how well the displayed trees fit the available data, including all methods based on input data consisting of clades, triples, quartets, or trees with any number of taxa, and also sequence-based approaches such as popular formalisations of maximum parsimony and maximum likelihood for networks. This identifiability problem is partially solved by accounting for branch lengths, although this merely reduces the frequency of the problem. Here we propose that network inference methods should only attempt to reconstruct what they can uniquely identify. To this end, we introduce a novel definition of what constitutes a uniquely reconstructible network. For any given set of indistinguishable networks, we define a canonical network that, under mild assumptions, is unique and thus representative of the entire set. Given data that underwent reticulate evolution, only the canonical form of the underlying phylogenetic network can be uniquely reconstructed. While on the methodological side this will imply a drastic reduction of the solution space in network inference, for the study of reticulate evolution this is a fundamental limitation that will require an important change of perspective when interpreting phylogenetic networks.

No MeSH data available.


Real-world example of indistinguishable network topologies.(Reproduced from [54], Fig. 4.) Three network topologies that display the two tree topologies in Fig. 3 of [54]. Note that these three networks are analogous to N1, N2 and N3 in Fig. 2 of the current paper: they each contain a reticulation cycle with three outgoing edges leading to the same three clades: {A.bicornis, A.longissima, A.sharonensis}, {A.uniaristata, A.comosa} and {A.tauschii} (in Fig. 2 instead of three clades we have three taxa a, b and c).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4388854&req=5

pcbi.1004135.g011: Real-world example of indistinguishable network topologies.(Reproduced from [54], Fig. 4.) Three network topologies that display the two tree topologies in Fig. 3 of [54]. Note that these three networks are analogous to N1, N2 and N3 in Fig. 2 of the current paper: they each contain a reticulation cycle with three outgoing edges leading to the same three clades: {A.bicornis, A.longissima, A.sharonensis}, {A.uniaristata, A.comosa} and {A.tauschii} (in Fig. 2 instead of three clades we have three taxa a, b and c).

Mentions: The existence of multiple optimal networks for a given data set is essentially due to two reasons: insufficient data and non-identifiability. For the example of 486 optimal solutions, this large number may be partly due to the fact that the goal was to achieve consistency with only two tree topologies. More data may enable to discriminate among the 486 returned networks. Non-identifiability, which occurs when none of the allowed data can discriminate between two or more networks, is a more serious problem than insufficient data, as it cannot be solved by simply increasing the size of the input sample. Another interesting example appears in a paper by Albrecht et al. [54], which we reproduce here in Fig. 11. Here, there are only three optimal networks, essentially differing for which of the three clades {A.bicornis, A.longissima, A.sharonensis}, {A.uniaristata, A.comosa} and {A.tauschii} is considered as a hybrid (in this example reticulations represent hybridizations). This pattern is entirely analogous to that of the three networks in Fig. 2 (with a, b and c replaced by the three clades above), meaning that these three networks are indistinguishable to methods not accounting for edge lengths. Therefore, in this example, the existence of multiple optimal solutions is entirely due to non-identifiability.


Reconstructible phylogenetic networks: do not distinguish the indistinguishable.

Pardi F, Scornavacca C - PLoS Comput. Biol. (2015)

Real-world example of indistinguishable network topologies.(Reproduced from [54], Fig. 4.) Three network topologies that display the two tree topologies in Fig. 3 of [54]. Note that these three networks are analogous to N1, N2 and N3 in Fig. 2 of the current paper: they each contain a reticulation cycle with three outgoing edges leading to the same three clades: {A.bicornis, A.longissima, A.sharonensis}, {A.uniaristata, A.comosa} and {A.tauschii} (in Fig. 2 instead of three clades we have three taxa a, b and c).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4388854&req=5

pcbi.1004135.g011: Real-world example of indistinguishable network topologies.(Reproduced from [54], Fig. 4.) Three network topologies that display the two tree topologies in Fig. 3 of [54]. Note that these three networks are analogous to N1, N2 and N3 in Fig. 2 of the current paper: they each contain a reticulation cycle with three outgoing edges leading to the same three clades: {A.bicornis, A.longissima, A.sharonensis}, {A.uniaristata, A.comosa} and {A.tauschii} (in Fig. 2 instead of three clades we have three taxa a, b and c).
Mentions: The existence of multiple optimal networks for a given data set is essentially due to two reasons: insufficient data and non-identifiability. For the example of 486 optimal solutions, this large number may be partly due to the fact that the goal was to achieve consistency with only two tree topologies. More data may enable to discriminate among the 486 returned networks. Non-identifiability, which occurs when none of the allowed data can discriminate between two or more networks, is a more serious problem than insufficient data, as it cannot be solved by simply increasing the size of the input sample. Another interesting example appears in a paper by Albrecht et al. [54], which we reproduce here in Fig. 11. Here, there are only three optimal networks, essentially differing for which of the three clades {A.bicornis, A.longissima, A.sharonensis}, {A.uniaristata, A.comosa} and {A.tauschii} is considered as a hybrid (in this example reticulations represent hybridizations). This pattern is entirely analogous to that of the three networks in Fig. 2 (with a, b and c replaced by the three clades above), meaning that these three networks are indistinguishable to methods not accounting for edge lengths. Therefore, in this example, the existence of multiple optimal solutions is entirely due to non-identifiability.

Bottom Line: This identifiability problem is partially solved by accounting for branch lengths, although this merely reduces the frequency of the problem.For any given set of indistinguishable networks, we define a canonical network that, under mild assumptions, is unique and thus representative of the entire set.While on the methodological side this will imply a drastic reduction of the solution space in network inference, for the study of reticulate evolution this is a fundamental limitation that will require an important change of perspective when interpreting phylogenetic networks.

View Article: PubMed Central - PubMed

Affiliation: Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM, UMR 5506) CNRS, Université de Montpellier, France; Institut de Biologie Computationnelle, Montpellier, France.

ABSTRACT
Phylogenetic networks represent the evolution of organisms that have undergone reticulate events, such as recombination, hybrid speciation or lateral gene transfer. An important way to interpret a phylogenetic network is in terms of the trees it displays, which represent all the possible histories of the characters carried by the organisms in the network. Interestingly, however, different networks may display exactly the same set of trees, an observation that poses a problem for network reconstruction: from the perspective of many inference methods such networks are "indistinguishable". This is true for all methods that evaluate a phylogenetic network solely on the basis of how well the displayed trees fit the available data, including all methods based on input data consisting of clades, triples, quartets, or trees with any number of taxa, and also sequence-based approaches such as popular formalisations of maximum parsimony and maximum likelihood for networks. This identifiability problem is partially solved by accounting for branch lengths, although this merely reduces the frequency of the problem. Here we propose that network inference methods should only attempt to reconstruct what they can uniquely identify. To this end, we introduce a novel definition of what constitutes a uniquely reconstructible network. For any given set of indistinguishable networks, we define a canonical network that, under mild assumptions, is unique and thus representative of the entire set. Given data that underwent reticulate evolution, only the canonical form of the underlying phylogenetic network can be uniquely reconstructed. While on the methodological side this will imply a drastic reduction of the solution space in network inference, for the study of reticulate evolution this is a fundamental limitation that will require an important change of perspective when interpreting phylogenetic networks.

No MeSH data available.