Limits...
On the accuracy of language trees.

Pompei S, Loreto V, Tria F - PLoS ONE (2011)

Bottom Line: This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them.A possible way out is to compare algorithmic inference with expert classifications.Based on these scores we quantify the relative performances of the distance-based algorithms considered.

View Article: PubMed Central - PubMed

Affiliation: Complex Systems Lagrange Lab, Institute for Scientific Interchange, Torino, Italy.

ABSTRACT
Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical, phonological, syntactic) features or characters for many different languages: a set of parallel corpora whose compilation represents a paramount achievement in linguistics. From this perspective the reconstruction of language trees is an example of inverse problems: starting from present, incomplete and often noisy, information, one aims at inferring the most likely past evolutionary history. A fundamental issue in inverse problems is the evaluation of the inference made. A standard way of dealing with this question is to generate data with artificial models in order to have full access to the evolutionary process one is going to infer. This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them. A possible way out is to compare algorithmic inference with expert classifications. This is the point of view we take here by conducting a thorough survey of the accuracy of reconstruction methods as compared with the Ethnologue expert classifications. We focus in particular on state-of-the-art distance-based methods for phylogeny reconstruction using worldwide linguistic databases. In order to assess the accuracy of the inferred trees we introduce and characterize two generalizations of standard definitions of distances between trees. Based on these scores we quantify the relative performances of the distance-based algorithms considered. Further we quantify how the completeness and the coverage of the available databases affect the accuracy of the reconstruction. Finally we draw some conclusions about where the accuracy of the reconstructions in historical linguistics stands and about the leading directions to improve it.

Show MeSH
Robinson-Foulds and Quartet Distance: errors due to a displacement of a couple of subtrees.The trees  and  are different because of the swap of the subtrees A and B. While computing the distance between  and , the Robinson-Foulds distance detects all the  edges in the path as errors, regardless of the size of the subtrees attached to them. The number of wrong butterflies quartets counted as errors with the Quartet Distance is expressed by : the QD thus depends on the size of the subtrees.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3108590&req=5

pone-0020109-g003: Robinson-Foulds and Quartet Distance: errors due to a displacement of a couple of subtrees.The trees and are different because of the swap of the subtrees A and B. While computing the distance between and , the Robinson-Foulds distance detects all the edges in the path as errors, regardless of the size of the subtrees attached to them. The number of wrong butterflies quartets counted as errors with the Quartet Distance is expressed by : the QD thus depends on the size of the subtrees.

Mentions: In [32], [33] a deep analysis of both RF and QD is reported, pointing out the different information the two measures convey. In limiting cases, pairs of trees that have the same RF distance but very different QD, and vice-versa, are also shown. In Fig. 3, quoting an enlightening example in [32], [33], we show how the RF and the QD measures weigh a swapping event of two subtrees in a tree. In this case the RF distance is equal to the number of edges in the path between the swapped subtrees, while the QD is sensitive to the size of the subtrees. The RF is then a good measure if we are interested in measuring how far apart subtrees are moved in one tree with respect to another. When we are interested instead in the size of the displaced subtrees, the quartet distance is a more adequate measure.


On the accuracy of language trees.

Pompei S, Loreto V, Tria F - PLoS ONE (2011)

Robinson-Foulds and Quartet Distance: errors due to a displacement of a couple of subtrees.The trees  and  are different because of the swap of the subtrees A and B. While computing the distance between  and , the Robinson-Foulds distance detects all the  edges in the path as errors, regardless of the size of the subtrees attached to them. The number of wrong butterflies quartets counted as errors with the Quartet Distance is expressed by : the QD thus depends on the size of the subtrees.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3108590&req=5

pone-0020109-g003: Robinson-Foulds and Quartet Distance: errors due to a displacement of a couple of subtrees.The trees and are different because of the swap of the subtrees A and B. While computing the distance between and , the Robinson-Foulds distance detects all the edges in the path as errors, regardless of the size of the subtrees attached to them. The number of wrong butterflies quartets counted as errors with the Quartet Distance is expressed by : the QD thus depends on the size of the subtrees.
Mentions: In [32], [33] a deep analysis of both RF and QD is reported, pointing out the different information the two measures convey. In limiting cases, pairs of trees that have the same RF distance but very different QD, and vice-versa, are also shown. In Fig. 3, quoting an enlightening example in [32], [33], we show how the RF and the QD measures weigh a swapping event of two subtrees in a tree. In this case the RF distance is equal to the number of edges in the path between the swapped subtrees, while the QD is sensitive to the size of the subtrees. The RF is then a good measure if we are interested in measuring how far apart subtrees are moved in one tree with respect to another. When we are interested instead in the size of the displaced subtrees, the quartet distance is a more adequate measure.

Bottom Line: This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them.A possible way out is to compare algorithmic inference with expert classifications.Based on these scores we quantify the relative performances of the distance-based algorithms considered.

View Article: PubMed Central - PubMed

Affiliation: Complex Systems Lagrange Lab, Institute for Scientific Interchange, Torino, Italy.

ABSTRACT
Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical, phonological, syntactic) features or characters for many different languages: a set of parallel corpora whose compilation represents a paramount achievement in linguistics. From this perspective the reconstruction of language trees is an example of inverse problems: starting from present, incomplete and often noisy, information, one aims at inferring the most likely past evolutionary history. A fundamental issue in inverse problems is the evaluation of the inference made. A standard way of dealing with this question is to generate data with artificial models in order to have full access to the evolutionary process one is going to infer. This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them. A possible way out is to compare algorithmic inference with expert classifications. This is the point of view we take here by conducting a thorough survey of the accuracy of reconstruction methods as compared with the Ethnologue expert classifications. We focus in particular on state-of-the-art distance-based methods for phylogeny reconstruction using worldwide linguistic databases. In order to assess the accuracy of the inferred trees we introduce and characterize two generalizations of standard definitions of distances between trees. Based on these scores we quantify the relative performances of the distance-based algorithms considered. Further we quantify how the completeness and the coverage of the available databases affect the accuracy of the reconstruction. Finally we draw some conclusions about where the accuracy of the reconstructions in historical linguistics stands and about the leading directions to improve it.

Show MeSH