Limits...
On the accuracy of language trees.

Pompei S, Loreto V, Tria F - PLoS ONE (2011)

Bottom Line: This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them.A possible way out is to compare algorithmic inference with expert classifications.Based on these scores we quantify the relative performances of the distance-based algorithms considered.

View Article: PubMed Central - PubMed

Affiliation: Complex Systems Lagrange Lab, Institute for Scientific Interchange, Torino, Italy.

ABSTRACT
Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical, phonological, syntactic) features or characters for many different languages: a set of parallel corpora whose compilation represents a paramount achievement in linguistics. From this perspective the reconstruction of language trees is an example of inverse problems: starting from present, incomplete and often noisy, information, one aims at inferring the most likely past evolutionary history. A fundamental issue in inverse problems is the evaluation of the inference made. A standard way of dealing with this question is to generate data with artificial models in order to have full access to the evolutionary process one is going to infer. This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them. A possible way out is to compare algorithmic inference with expert classifications. This is the point of view we take here by conducting a thorough survey of the accuracy of reconstruction methods as compared with the Ethnologue expert classifications. We focus in particular on state-of-the-art distance-based methods for phylogeny reconstruction using worldwide linguistic databases. In order to assess the accuracy of the inferred trees we introduce and characterize two generalizations of standard definitions of distances between trees. Based on these scores we quantify the relative performances of the distance-based algorithms considered. Further we quantify how the completeness and the coverage of the available databases affect the accuracy of the reconstruction. Finally we draw some conclusions about where the accuracy of the reconstructions in historical linguistics stands and about the leading directions to improve it.

Show MeSH

Related in: MedlinePlus

Worldwide accuracy of the inferred language trees.This map represents the level of accuracy of the FastSBiX algorithm on several language families throughout the world. The colors code the values of the Generalized Quartet Distance (GQD) between the trees inferred with the FastSBiX algorithm and the LDND definition of distance for each language family included in the ASJP database and the corresponding Ethnologue classifications. The GQD is normalized with the corresponding random value (see text for details). On the one hand blue regions corresponds to language families for which the inferred trees strongly agree with the Ethnologue classification. On the other hand red regions corresponds to poorly reconstructed language families. Yellow is for the families in which a random reconstruction would get a GQD score of zero, meaning that the Ethnologue classification has a  resolution (the corresponding tree is a star). Grey areas are those for which no data are present in the databases adopted for the reconstruction. Asterisks are for regions which include more than one family of languages. See File S1 for the analogous maps obtained with different algorithms and different definitions of the distance between languages.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3108590&req=5

pone-0020109-g007: Worldwide accuracy of the inferred language trees.This map represents the level of accuracy of the FastSBiX algorithm on several language families throughout the world. The colors code the values of the Generalized Quartet Distance (GQD) between the trees inferred with the FastSBiX algorithm and the LDND definition of distance for each language family included in the ASJP database and the corresponding Ethnologue classifications. The GQD is normalized with the corresponding random value (see text for details). On the one hand blue regions corresponds to language families for which the inferred trees strongly agree with the Ethnologue classification. On the other hand red regions corresponds to poorly reconstructed language families. Yellow is for the families in which a random reconstruction would get a GQD score of zero, meaning that the Ethnologue classification has a resolution (the corresponding tree is a star). Grey areas are those for which no data are present in the databases adopted for the reconstruction. Asterisks are for regions which include more than one family of languages. See File S1 for the analogous maps obtained with different algorithms and different definitions of the distance between languages.

Mentions: We finally give a pictorial view of the accuracy of the reconstruction algorithm across the planet. Figure 7 illustrates the Generalized Quartet Distance for the different language families on the world map, normalized with the corresponding random value. More specifically, the color codes, for each family , the following quantity:(9)where represents the mean value of the GQD obtained averaging over randomly reconstructed trees with the same leaves (languages) of the family . quantifies the level of accuracy of the reconstruction with respect to a model. The multiplicative factor is included for the sake of better visualization: indicates a equal or higher to half of the random tree distance .


On the accuracy of language trees.

Pompei S, Loreto V, Tria F - PLoS ONE (2011)

Worldwide accuracy of the inferred language trees.This map represents the level of accuracy of the FastSBiX algorithm on several language families throughout the world. The colors code the values of the Generalized Quartet Distance (GQD) between the trees inferred with the FastSBiX algorithm and the LDND definition of distance for each language family included in the ASJP database and the corresponding Ethnologue classifications. The GQD is normalized with the corresponding random value (see text for details). On the one hand blue regions corresponds to language families for which the inferred trees strongly agree with the Ethnologue classification. On the other hand red regions corresponds to poorly reconstructed language families. Yellow is for the families in which a random reconstruction would get a GQD score of zero, meaning that the Ethnologue classification has a  resolution (the corresponding tree is a star). Grey areas are those for which no data are present in the databases adopted for the reconstruction. Asterisks are for regions which include more than one family of languages. See File S1 for the analogous maps obtained with different algorithms and different definitions of the distance between languages.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3108590&req=5

pone-0020109-g007: Worldwide accuracy of the inferred language trees.This map represents the level of accuracy of the FastSBiX algorithm on several language families throughout the world. The colors code the values of the Generalized Quartet Distance (GQD) between the trees inferred with the FastSBiX algorithm and the LDND definition of distance for each language family included in the ASJP database and the corresponding Ethnologue classifications. The GQD is normalized with the corresponding random value (see text for details). On the one hand blue regions corresponds to language families for which the inferred trees strongly agree with the Ethnologue classification. On the other hand red regions corresponds to poorly reconstructed language families. Yellow is for the families in which a random reconstruction would get a GQD score of zero, meaning that the Ethnologue classification has a resolution (the corresponding tree is a star). Grey areas are those for which no data are present in the databases adopted for the reconstruction. Asterisks are for regions which include more than one family of languages. See File S1 for the analogous maps obtained with different algorithms and different definitions of the distance between languages.
Mentions: We finally give a pictorial view of the accuracy of the reconstruction algorithm across the planet. Figure 7 illustrates the Generalized Quartet Distance for the different language families on the world map, normalized with the corresponding random value. More specifically, the color codes, for each family , the following quantity:(9)where represents the mean value of the GQD obtained averaging over randomly reconstructed trees with the same leaves (languages) of the family . quantifies the level of accuracy of the reconstruction with respect to a model. The multiplicative factor is included for the sake of better visualization: indicates a equal or higher to half of the random tree distance .

Bottom Line: This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them.A possible way out is to compare algorithmic inference with expert classifications.Based on these scores we quantify the relative performances of the distance-based algorithms considered.

View Article: PubMed Central - PubMed

Affiliation: Complex Systems Lagrange Lab, Institute for Scientific Interchange, Torino, Italy.

ABSTRACT
Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical, phonological, syntactic) features or characters for many different languages: a set of parallel corpora whose compilation represents a paramount achievement in linguistics. From this perspective the reconstruction of language trees is an example of inverse problems: starting from present, incomplete and often noisy, information, one aims at inferring the most likely past evolutionary history. A fundamental issue in inverse problems is the evaluation of the inference made. A standard way of dealing with this question is to generate data with artificial models in order to have full access to the evolutionary process one is going to infer. This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them. A possible way out is to compare algorithmic inference with expert classifications. This is the point of view we take here by conducting a thorough survey of the accuracy of reconstruction methods as compared with the Ethnologue expert classifications. We focus in particular on state-of-the-art distance-based methods for phylogeny reconstruction using worldwide linguistic databases. In order to assess the accuracy of the inferred trees we introduce and characterize two generalizations of standard definitions of distances between trees. Based on these scores we quantify the relative performances of the distance-based algorithms considered. Further we quantify how the completeness and the coverage of the available databases affect the accuracy of the reconstruction. Finally we draw some conclusions about where the accuracy of the reconstructions in historical linguistics stands and about the leading directions to improve it.

Show MeSH
Related in: MedlinePlus