Limits...
Tree edit distance for leaf-labelled trees on free leafset and its comparison with frequent subsplit dissimilarity and popular distance measures.

Koperwas J, Walczak K - BMC Bioinformatics (2011)

Bottom Line: Two of the presented methods carry the most interesting properties.E(3,1) is very discriminative (having a wide range of values) and has a very regular distance distribution which is similar to a normal distribution in its shape and is good both for similar and non-similar trees.NFC(2,1) on the other hand is proportional or nearly proportional to the number of mutation operations used, irrespective of their type.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland. j.koperwas@elka.pw.edu.pl

ABSTRACT

Background: This paper is devoted to distance measures for leaf-labelled trees on free leafset. A leaf-labelled tree is a data structure which is a special type of a tree where only leaves (terminal) nodes are labelled. This data structure is used in bioinformatics for modelling of evolution history of genes and species and also in linguistics for modelling of languages evolution history. Many domain specific problems occur and need to be solved with help of tree postprocessing techniques such as distance measures.

Results: Here we introduce the tree edit distance designed for leaf labelled trees on free leafset, which occurs to be a metric. It is presented together with tree edit consensus tree notion. We provide statistical evaluation of provided measure with respect to R-F, MAST and frequent subsplit based dissimilarity measures as the reference measures.

Conclusions: The tree edit distance was proven to be a metric and has the advantage of using different costs for contraction and pruning, therefore their properties can be tuned depending on the needs of the user. Two of the presented methods carry the most interesting properties. E(3,1) is very discriminative (having a wide range of values) and has a very regular distance distribution which is similar to a normal distribution in its shape and is good both for similar and non-similar trees. NFC(2,1) on the other hand is proportional or nearly proportional to the number of mutation operations used, irrespective of their type.

Show MeSH

Related in: MedlinePlus

Two trees on the same leafset where the Edit distance is more suitable than others.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117735&req=5

Figure 15: Two trees on the same leafset where the Edit distance is more suitable than others.

Mentions: These results show that, in some situations, pruning operations are better at unifying trees, sometimes contractions are, and sometimes neither performs well. However, there are cases when using both of them is better. To sum up below, there are some cases when one editing operation is better than another: Pruning is better: when the trees are not on the same leafset, then pruning is necessary (figures 2, 11, 14) the trees may be on the same leafset but they contain some noisy leaf (leaf d in trees T1 and T2 from Figure 12). Contraction is better: when two significantly large sub-trees are connected directly on one tree, but connected with an additional edge on the second tree (Figure 13). Both operations seem to be equally good when the trees are on the same leafset, there is no noisy data, and the degrees of the nodes are relatively small. Because there are some cases when one operation is better than the other, a distance based on both operations shall be better than one based on only one operation. A distance constructed in this way will choose the most appropriate editing operations. For example, consider Figure 15, the c-distance equals 8 (all nontrivial splits), p-distance (d and b in both trees plus 3 forced contractions) equals 7, pc-distance (i.e. edit distance with unitary costs) equals 4 (d in both trees plus 2 splits)


Tree edit distance for leaf-labelled trees on free leafset and its comparison with frequent subsplit dissimilarity and popular distance measures.

Koperwas J, Walczak K - BMC Bioinformatics (2011)

Two trees on the same leafset where the Edit distance is more suitable than others.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117735&req=5

Figure 15: Two trees on the same leafset where the Edit distance is more suitable than others.
Mentions: These results show that, in some situations, pruning operations are better at unifying trees, sometimes contractions are, and sometimes neither performs well. However, there are cases when using both of them is better. To sum up below, there are some cases when one editing operation is better than another: Pruning is better: when the trees are not on the same leafset, then pruning is necessary (figures 2, 11, 14) the trees may be on the same leafset but they contain some noisy leaf (leaf d in trees T1 and T2 from Figure 12). Contraction is better: when two significantly large sub-trees are connected directly on one tree, but connected with an additional edge on the second tree (Figure 13). Both operations seem to be equally good when the trees are on the same leafset, there is no noisy data, and the degrees of the nodes are relatively small. Because there are some cases when one operation is better than the other, a distance based on both operations shall be better than one based on only one operation. A distance constructed in this way will choose the most appropriate editing operations. For example, consider Figure 15, the c-distance equals 8 (all nontrivial splits), p-distance (d and b in both trees plus 3 forced contractions) equals 7, pc-distance (i.e. edit distance with unitary costs) equals 4 (d in both trees plus 2 splits)

Bottom Line: Two of the presented methods carry the most interesting properties.E(3,1) is very discriminative (having a wide range of values) and has a very regular distance distribution which is similar to a normal distribution in its shape and is good both for similar and non-similar trees.NFC(2,1) on the other hand is proportional or nearly proportional to the number of mutation operations used, irrespective of their type.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland. j.koperwas@elka.pw.edu.pl

ABSTRACT

Background: This paper is devoted to distance measures for leaf-labelled trees on free leafset. A leaf-labelled tree is a data structure which is a special type of a tree where only leaves (terminal) nodes are labelled. This data structure is used in bioinformatics for modelling of evolution history of genes and species and also in linguistics for modelling of languages evolution history. Many domain specific problems occur and need to be solved with help of tree postprocessing techniques such as distance measures.

Results: Here we introduce the tree edit distance designed for leaf labelled trees on free leafset, which occurs to be a metric. It is presented together with tree edit consensus tree notion. We provide statistical evaluation of provided measure with respect to R-F, MAST and frequent subsplit based dissimilarity measures as the reference measures.

Conclusions: The tree edit distance was proven to be a metric and has the advantage of using different costs for contraction and pruning, therefore their properties can be tuned depending on the needs of the user. Two of the presented methods carry the most interesting properties. E(3,1) is very discriminative (having a wide range of values) and has a very regular distance distribution which is similar to a normal distribution in its shape and is good both for similar and non-similar trees. NFC(2,1) on the other hand is proportional or nearly proportional to the number of mutation operations used, irrespective of their type.

Show MeSH
Related in: MedlinePlus