Tree edit distance for leaf-labelled trees on free leafset and its comparison with frequent subsplit dissimilarity and popular distance measures.
Bottom Line:
Two of the presented methods carry the most interesting properties.E(3,1) is very discriminative (having a wide range of values) and has a very regular distance distribution which is similar to a normal distribution in its shape and is good both for similar and non-similar trees.NFC(2,1) on the other hand is proportional or nearly proportional to the number of mutation operations used, irrespective of their type.
Affiliation: Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland. j.koperwas@elka.pw.edu.pl
ABSTRACT
Show MeSH
Background: This paper is devoted to distance measures for leaf-labelled trees on free leafset. A leaf-labelled tree is a data structure which is a special type of a tree where only leaves (terminal) nodes are labelled. This data structure is used in bioinformatics for modelling of evolution history of genes and species and also in linguistics for modelling of languages evolution history. Many domain specific problems occur and need to be solved with help of tree postprocessing techniques such as distance measures. Results: Here we introduce the tree edit distance designed for leaf labelled trees on free leafset, which occurs to be a metric. It is presented together with tree edit consensus tree notion. We provide statistical evaluation of provided measure with respect to R-F, MAST and frequent subsplit based dissimilarity measures as the reference measures. Conclusions: The tree edit distance was proven to be a metric and has the advantage of using different costs for contraction and pruning, therefore their properties can be tuned depending on the needs of the user. Two of the presented methods carry the most interesting properties. E(3,1) is very discriminative (having a wide range of values) and has a very regular distance distribution which is similar to a normal distribution in its shape and is good both for similar and non-similar trees. NFC(2,1) on the other hand is proportional or nearly proportional to the number of mutation operations used, irrespective of their type. Related in: MedlinePlus |
Related In:
Results -
Collection
License getmorefigures.php?uid=PMC3117735&req=5
Mentions: For a pruning mutation, the situation looks very similar (see Figure 29) i.e. all distances would have identical values if scaled, however a few things should be pointed out. The R-F distance here gets smaller with increasing number of p operations. This is because it does not work for trees with different leafsets, in such a case it returns the maximum possible value, which is lower for a smaller leafset, and this is exactly what is illustrated in the Figure. Another point is that while the R-F was the distance best scaled for contractions, MAST is the distance best scaled for pruning. This is natural because R-F uses contraction while MAST uses pruning. So if we consider not just pruning nor just contraction, but wish to use both, then the distances do not have the same dynamics. What is seen here is that NFC(2,1) scales best for both contraction and pruning. This can be visualised better when we see the reaction of distances on contraction and pruning on the same chart. In Figure 30, we can see the reaction of R-F, MAST and NFC(2,1) with respect to contraction, pruning, and Nearest Non-Brother Interchange (i.e an operation that is neither contraction nor pruning but the distance can be realised by both of these operations). We can see that only the NFC dynamics are similar irrespective of the type of mutation used. |
View Article: PubMed Central - HTML - PubMed
Affiliation: Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland. j.koperwas@elka.pw.edu.pl
Background: This paper is devoted to distance measures for leaf-labelled trees on free leafset. A leaf-labelled tree is a data structure which is a special type of a tree where only leaves (terminal) nodes are labelled. This data structure is used in bioinformatics for modelling of evolution history of genes and species and also in linguistics for modelling of languages evolution history. Many domain specific problems occur and need to be solved with help of tree postprocessing techniques such as distance measures.
Results: Here we introduce the tree edit distance designed for leaf labelled trees on free leafset, which occurs to be a metric. It is presented together with tree edit consensus tree notion. We provide statistical evaluation of provided measure with respect to R-F, MAST and frequent subsplit based dissimilarity measures as the reference measures.
Conclusions: The tree edit distance was proven to be a metric and has the advantage of using different costs for contraction and pruning, therefore their properties can be tuned depending on the needs of the user. Two of the presented methods carry the most interesting properties. E(3,1) is very discriminative (having a wide range of values) and has a very regular distance distribution which is similar to a normal distribution in its shape and is good both for similar and non-similar trees. NFC(2,1) on the other hand is proportional or nearly proportional to the number of mutation operations used, irrespective of their type.