Limits...
On the accuracy of language trees.

Pompei S, Loreto V, Tria F - PLoS ONE (2011)

Bottom Line: This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them.A possible way out is to compare algorithmic inference with expert classifications.Based on these scores we quantify the relative performances of the distance-based algorithms considered.

View Article: PubMed Central - PubMed

Affiliation: Complex Systems Lagrange Lab, Institute for Scientific Interchange, Torino, Italy.

ABSTRACT
Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical, phonological, syntactic) features or characters for many different languages: a set of parallel corpora whose compilation represents a paramount achievement in linguistics. From this perspective the reconstruction of language trees is an example of inverse problems: starting from present, incomplete and often noisy, information, one aims at inferring the most likely past evolutionary history. A fundamental issue in inverse problems is the evaluation of the inference made. A standard way of dealing with this question is to generate data with artificial models in order to have full access to the evolutionary process one is going to infer. This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them. A possible way out is to compare algorithmic inference with expert classifications. This is the point of view we take here by conducting a thorough survey of the accuracy of reconstruction methods as compared with the Ethnologue expert classifications. We focus in particular on state-of-the-art distance-based methods for phylogeny reconstruction using worldwide linguistic databases. In order to assess the accuracy of the inferred trees we introduce and characterize two generalizations of standard definitions of distances between trees. Based on these scores we quantify the relative performances of the distance-based algorithms considered. Further we quantify how the completeness and the coverage of the available databases affect the accuracy of the reconstruction. Finally we draw some conclusions about where the accuracy of the reconstructions in historical linguistics stands and about the leading directions to improve it.

Show MeSH
Role of the word-list completeness and coverage.(left) the Generalized Robinson-Foulds (GRF) score between the inferred trees and the corresponding Ethnologue classification for the Austronesian family, vs. the number  of most shared words, both for the ASJP and the ABVD databases. The inset reports the behaviour of , the effective number of most shared words, defines as follows. For each list  is the sum of all the value of  for all the meanings in the list. In this way  quantifies the effective number of most shared meanings. There is a strong correlation between  and  for . For   does not increase anymore in the ASJP database. This explains why the GRF does not decrease for  for the ASJP database. (right) the Generalized Quartet Distance (GQD) between the inferred trees and the corresponding Ethnologue classification for the Austronesian family, vs. the number  of most shared words, both for the ASJP and the ABVD databases. The inset reports the behaviour of the Coverage, which measures the degree of alignment of the word-lists for the different languages considered, vs.  (see text for details about the definition of Coverage). Again there is a strong correlation between the Coverage and . The distance-based algorithm used is FastSBiX with the LDN definition of distance.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3108590&req=5

pone-0020109-g008: Role of the word-list completeness and coverage.(left) the Generalized Robinson-Foulds (GRF) score between the inferred trees and the corresponding Ethnologue classification for the Austronesian family, vs. the number of most shared words, both for the ASJP and the ABVD databases. The inset reports the behaviour of , the effective number of most shared words, defines as follows. For each list is the sum of all the value of for all the meanings in the list. In this way quantifies the effective number of most shared meanings. There is a strong correlation between and for . For does not increase anymore in the ASJP database. This explains why the GRF does not decrease for for the ASJP database. (right) the Generalized Quartet Distance (GQD) between the inferred trees and the corresponding Ethnologue classification for the Austronesian family, vs. the number of most shared words, both for the ASJP and the ABVD databases. The inset reports the behaviour of the Coverage, which measures the degree of alignment of the word-lists for the different languages considered, vs. (see text for details about the definition of Coverage). Again there is a strong correlation between the Coverage and . The distance-based algorithm used is FastSBiX with the LDN definition of distance.

Mentions: We compute the dissimilarity matrices by making use of only the reduced lists constructed as above, and we use those matrices as starting point for the reconstruction algorithm (we use the FastSBiX algorithm for all the results discussed below). Fig. 8 reports the Generalized Robinson-Foulds score (left) and the Generalized Quartet Distance (right) between the inferred trees and the corresponding Ethnologue classifications, as a function of the number of chosen words, for both the AJSP and the ABVD databases. As a general trend, the number of errors decreases when the size of the word-lists considered increases. Though the large improvement of the accuracy occurs by adding the first or words, a slow improvement of the accuracy is always there if one keeps increasing the word-lists size. This already points in the direction that, in order to improve the accuracy of the phylogenetic reconstruction, one has to increase the size of the word-lists. The accuracy obtained with the ABVD and ASJP databases are very similar when considering the first most shared words. Upon increasing , ASJP does not feature any improvement while ABVD keeps improving its accuracy, although very slowly, when . A possible explanation for this could be related to the presence, in the ASJP database, of meanings with a very low level of sharing (see inset of the left panel of Fig. 8 as well as Fig. 2).


On the accuracy of language trees.

Pompei S, Loreto V, Tria F - PLoS ONE (2011)

Role of the word-list completeness and coverage.(left) the Generalized Robinson-Foulds (GRF) score between the inferred trees and the corresponding Ethnologue classification for the Austronesian family, vs. the number  of most shared words, both for the ASJP and the ABVD databases. The inset reports the behaviour of , the effective number of most shared words, defines as follows. For each list  is the sum of all the value of  for all the meanings in the list. In this way  quantifies the effective number of most shared meanings. There is a strong correlation between  and  for . For   does not increase anymore in the ASJP database. This explains why the GRF does not decrease for  for the ASJP database. (right) the Generalized Quartet Distance (GQD) between the inferred trees and the corresponding Ethnologue classification for the Austronesian family, vs. the number  of most shared words, both for the ASJP and the ABVD databases. The inset reports the behaviour of the Coverage, which measures the degree of alignment of the word-lists for the different languages considered, vs.  (see text for details about the definition of Coverage). Again there is a strong correlation between the Coverage and . The distance-based algorithm used is FastSBiX with the LDN definition of distance.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3108590&req=5

pone-0020109-g008: Role of the word-list completeness and coverage.(left) the Generalized Robinson-Foulds (GRF) score between the inferred trees and the corresponding Ethnologue classification for the Austronesian family, vs. the number of most shared words, both for the ASJP and the ABVD databases. The inset reports the behaviour of , the effective number of most shared words, defines as follows. For each list is the sum of all the value of for all the meanings in the list. In this way quantifies the effective number of most shared meanings. There is a strong correlation between and for . For does not increase anymore in the ASJP database. This explains why the GRF does not decrease for for the ASJP database. (right) the Generalized Quartet Distance (GQD) between the inferred trees and the corresponding Ethnologue classification for the Austronesian family, vs. the number of most shared words, both for the ASJP and the ABVD databases. The inset reports the behaviour of the Coverage, which measures the degree of alignment of the word-lists for the different languages considered, vs. (see text for details about the definition of Coverage). Again there is a strong correlation between the Coverage and . The distance-based algorithm used is FastSBiX with the LDN definition of distance.
Mentions: We compute the dissimilarity matrices by making use of only the reduced lists constructed as above, and we use those matrices as starting point for the reconstruction algorithm (we use the FastSBiX algorithm for all the results discussed below). Fig. 8 reports the Generalized Robinson-Foulds score (left) and the Generalized Quartet Distance (right) between the inferred trees and the corresponding Ethnologue classifications, as a function of the number of chosen words, for both the AJSP and the ABVD databases. As a general trend, the number of errors decreases when the size of the word-lists considered increases. Though the large improvement of the accuracy occurs by adding the first or words, a slow improvement of the accuracy is always there if one keeps increasing the word-lists size. This already points in the direction that, in order to improve the accuracy of the phylogenetic reconstruction, one has to increase the size of the word-lists. The accuracy obtained with the ABVD and ASJP databases are very similar when considering the first most shared words. Upon increasing , ASJP does not feature any improvement while ABVD keeps improving its accuracy, although very slowly, when . A possible explanation for this could be related to the presence, in the ASJP database, of meanings with a very low level of sharing (see inset of the left panel of Fig. 8 as well as Fig. 2).

Bottom Line: This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them.A possible way out is to compare algorithmic inference with expert classifications.Based on these scores we quantify the relative performances of the distance-based algorithms considered.

View Article: PubMed Central - PubMed

Affiliation: Complex Systems Lagrange Lab, Institute for Scientific Interchange, Torino, Italy.

ABSTRACT
Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical, phonological, syntactic) features or characters for many different languages: a set of parallel corpora whose compilation represents a paramount achievement in linguistics. From this perspective the reconstruction of language trees is an example of inverse problems: starting from present, incomplete and often noisy, information, one aims at inferring the most likely past evolutionary history. A fundamental issue in inverse problems is the evaluation of the inference made. A standard way of dealing with this question is to generate data with artificial models in order to have full access to the evolutionary process one is going to infer. This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them. A possible way out is to compare algorithmic inference with expert classifications. This is the point of view we take here by conducting a thorough survey of the accuracy of reconstruction methods as compared with the Ethnologue expert classifications. We focus in particular on state-of-the-art distance-based methods for phylogeny reconstruction using worldwide linguistic databases. In order to assess the accuracy of the inferred trees we introduce and characterize two generalizations of standard definitions of distances between trees. Based on these scores we quantify the relative performances of the distance-based algorithms considered. Further we quantify how the completeness and the coverage of the available databases affect the accuracy of the reconstruction. Finally we draw some conclusions about where the accuracy of the reconstructions in historical linguistics stands and about the leading directions to improve it.

Show MeSH