Limits...
On the accuracy of language trees.

Pompei S, Loreto V, Tria F - PLoS ONE (2011)

Bottom Line: This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them.A possible way out is to compare algorithmic inference with expert classifications.Based on these scores we quantify the relative performances of the distance-based algorithms considered.

View Article: PubMed Central - PubMed

Affiliation: Complex Systems Lagrange Lab, Institute for Scientific Interchange, Torino, Italy.

ABSTRACT
Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical, phonological, syntactic) features or characters for many different languages: a set of parallel corpora whose compilation represents a paramount achievement in linguistics. From this perspective the reconstruction of language trees is an example of inverse problems: starting from present, incomplete and often noisy, information, one aims at inferring the most likely past evolutionary history. A fundamental issue in inverse problems is the evaluation of the inference made. A standard way of dealing with this question is to generate data with artificial models in order to have full access to the evolutionary process one is going to infer. This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them. A possible way out is to compare algorithmic inference with expert classifications. This is the point of view we take here by conducting a thorough survey of the accuracy of reconstruction methods as compared with the Ethnologue expert classifications. We focus in particular on state-of-the-art distance-based methods for phylogeny reconstruction using worldwide linguistic databases. In order to assess the accuracy of the inferred trees we introduce and characterize two generalizations of standard definitions of distances between trees. Based on these scores we quantify the relative performances of the distance-based algorithms considered. Further we quantify how the completeness and the coverage of the available databases affect the accuracy of the reconstruction. Finally we draw some conclusions about where the accuracy of the reconstructions in historical linguistics stands and about the leading directions to improve it.

Show MeSH

Related in: MedlinePlus

Non-binary nodes: biases of errors.The standard Robinson-Foulds distance and the Quartet Distance have a bias when comparing binary trees with non-binary classifications. The difference between tree  and  is that  shows a more fine grained classification. The two trees, however, are not conflicting, since  is simply a refinement of the classification . The RF distance will count every internal edge (blue ones in ) of this refinement as errors, since they are not in . The QD will count every quartet including the blue edges as errors, since all these quartets are stars in . The generalized measures we introduce correctly give a  score between  and  in the example.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3108590&req=5

pone-0020109-g004: Non-binary nodes: biases of errors.The standard Robinson-Foulds distance and the Quartet Distance have a bias when comparing binary trees with non-binary classifications. The difference between tree and is that shows a more fine grained classification. The two trees, however, are not conflicting, since is simply a refinement of the classification . The RF distance will count every internal edge (blue ones in ) of this refinement as errors, since they are not in . The QD will count every quartet including the blue edges as errors, since all these quartets are stars in . The generalized measures we introduce correctly give a score between and in the example.

Mentions: Figure 4 illustrates a situation when a binary tree () is compared with a non-binary one (). Both the RF and the QD give a non zero distance between the two trees: some partitions of are in fact not present in . It is important to consider, however, that in the case we are considering (algorithmic inference versus Ethnologue classification) non-binary classification is simply due to a lack of information or details that would lead to a finer classification. We would like to be able to distinguish intrinsic contradictions between reconstructed binary trees and the Ethnologue classifications from errors due to the low level of resolution of the Ethnologue trees. It is with this aim in mind that we introduce a generalization of both the RF distance and the QD.


On the accuracy of language trees.

Pompei S, Loreto V, Tria F - PLoS ONE (2011)

Non-binary nodes: biases of errors.The standard Robinson-Foulds distance and the Quartet Distance have a bias when comparing binary trees with non-binary classifications. The difference between tree  and  is that  shows a more fine grained classification. The two trees, however, are not conflicting, since  is simply a refinement of the classification . The RF distance will count every internal edge (blue ones in ) of this refinement as errors, since they are not in . The QD will count every quartet including the blue edges as errors, since all these quartets are stars in . The generalized measures we introduce correctly give a  score between  and  in the example.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3108590&req=5

pone-0020109-g004: Non-binary nodes: biases of errors.The standard Robinson-Foulds distance and the Quartet Distance have a bias when comparing binary trees with non-binary classifications. The difference between tree and is that shows a more fine grained classification. The two trees, however, are not conflicting, since is simply a refinement of the classification . The RF distance will count every internal edge (blue ones in ) of this refinement as errors, since they are not in . The QD will count every quartet including the blue edges as errors, since all these quartets are stars in . The generalized measures we introduce correctly give a score between and in the example.
Mentions: Figure 4 illustrates a situation when a binary tree () is compared with a non-binary one (). Both the RF and the QD give a non zero distance between the two trees: some partitions of are in fact not present in . It is important to consider, however, that in the case we are considering (algorithmic inference versus Ethnologue classification) non-binary classification is simply due to a lack of information or details that would lead to a finer classification. We would like to be able to distinguish intrinsic contradictions between reconstructed binary trees and the Ethnologue classifications from errors due to the low level of resolution of the Ethnologue trees. It is with this aim in mind that we introduce a generalization of both the RF distance and the QD.

Bottom Line: This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them.A possible way out is to compare algorithmic inference with expert classifications.Based on these scores we quantify the relative performances of the distance-based algorithms considered.

View Article: PubMed Central - PubMed

Affiliation: Complex Systems Lagrange Lab, Institute for Scientific Interchange, Torino, Italy.

ABSTRACT
Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical, phonological, syntactic) features or characters for many different languages: a set of parallel corpora whose compilation represents a paramount achievement in linguistics. From this perspective the reconstruction of language trees is an example of inverse problems: starting from present, incomplete and often noisy, information, one aims at inferring the most likely past evolutionary history. A fundamental issue in inverse problems is the evaluation of the inference made. A standard way of dealing with this question is to generate data with artificial models in order to have full access to the evolutionary process one is going to infer. This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them. A possible way out is to compare algorithmic inference with expert classifications. This is the point of view we take here by conducting a thorough survey of the accuracy of reconstruction methods as compared with the Ethnologue expert classifications. We focus in particular on state-of-the-art distance-based methods for phylogeny reconstruction using worldwide linguistic databases. In order to assess the accuracy of the inferred trees we introduce and characterize two generalizations of standard definitions of distances between trees. Based on these scores we quantify the relative performances of the distance-based algorithms considered. Further we quantify how the completeness and the coverage of the available databases affect the accuracy of the reconstruction. Finally we draw some conclusions about where the accuracy of the reconstructions in historical linguistics stands and about the leading directions to improve it.

Show MeSH
Related in: MedlinePlus