Limits...
On the accuracy of language trees.

Pompei S, Loreto V, Tria F - PLoS ONE (2011)

Bottom Line: This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them.A possible way out is to compare algorithmic inference with expert classifications.Based on these scores we quantify the relative performances of the distance-based algorithms considered.

View Article: PubMed Central - PubMed

Affiliation: Complex Systems Lagrange Lab, Institute for Scientific Interchange, Torino, Italy.

ABSTRACT
Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical, phonological, syntactic) features or characters for many different languages: a set of parallel corpora whose compilation represents a paramount achievement in linguistics. From this perspective the reconstruction of language trees is an example of inverse problems: starting from present, incomplete and often noisy, information, one aims at inferring the most likely past evolutionary history. A fundamental issue in inverse problems is the evaluation of the inference made. A standard way of dealing with this question is to generate data with artificial models in order to have full access to the evolutionary process one is going to infer. This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them. A possible way out is to compare algorithmic inference with expert classifications. This is the point of view we take here by conducting a thorough survey of the accuracy of reconstruction methods as compared with the Ethnologue expert classifications. We focus in particular on state-of-the-art distance-based methods for phylogeny reconstruction using worldwide linguistic databases. In order to assess the accuracy of the inferred trees we introduce and characterize two generalizations of standard definitions of distances between trees. Based on these scores we quantify the relative performances of the distance-based algorithms considered. Further we quantify how the completeness and the coverage of the available databases affect the accuracy of the reconstruction. Finally we draw some conclusions about where the accuracy of the reconstructions in historical linguistics stands and about the leading directions to improve it.

Show MeSH
Top: Statistics of the ASJP database.(left panel) Fraction-rank plot: for each word in the lists of words of the Automated Similarity Judgement Project (ASJP), we measured the fraction of languages containing it. The plot reports this fraction vs. its rank. In the -items lists in the ASJP database, only  meanings are shared by almost  of the languages for each family. (right panel) Ranked fraction of pairs of languages sharing each specific word vs. rank. Again only  meanings are shared by almost  of the pairs of languages. Bottom: Statistical measures on the ABVD database. (left panel) Fraction-rank plot: for each word in the lists of words of the Austronesian Basic Vocabulary Database (ABVD), we measured the fraction of languages containing it. The plot reports this fraction vs. its rank. (right panel) Ranked fraction of pairs of languages sharing each specific word vs. rank. For sake of a rough comparison we also reported the same quantities measured on the Austronesian family of the ASJP database. The ASJP includes  words up to a maximum of almost  of the languages, whereas in the ABVD the percentage of coverage is at least of  for almost all the words in the list. Limited to the  most shared words the ASJP database features a slightly larger coverage than the ABVD database.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3108590&req=5

pone-0020109-g002: Top: Statistics of the ASJP database.(left panel) Fraction-rank plot: for each word in the lists of words of the Automated Similarity Judgement Project (ASJP), we measured the fraction of languages containing it. The plot reports this fraction vs. its rank. In the -items lists in the ASJP database, only meanings are shared by almost of the languages for each family. (right panel) Ranked fraction of pairs of languages sharing each specific word vs. rank. Again only meanings are shared by almost of the pairs of languages. Bottom: Statistical measures on the ABVD database. (left panel) Fraction-rank plot: for each word in the lists of words of the Austronesian Basic Vocabulary Database (ABVD), we measured the fraction of languages containing it. The plot reports this fraction vs. its rank. (right panel) Ranked fraction of pairs of languages sharing each specific word vs. rank. For sake of a rough comparison we also reported the same quantities measured on the Austronesian family of the ASJP database. The ASJP includes words up to a maximum of almost of the languages, whereas in the ABVD the percentage of coverage is at least of for almost all the words in the list. Limited to the most shared words the ASJP database features a slightly larger coverage than the ABVD database.

Mentions: The Automated Similarity Judgement Program (ASJP) [27] includes -items word lists of about 50 families of languages throughout the world. These lists are written in a standardized orthography (ASJP code) which employs only symbols of the standard QWERTY keyboard, defining vowels, consonants and phonological features. The full database is available at http://email.eva.mpg.de/~wichmann/ASJPHomePage.htm. Figure 2 (top) reports two statistical measures on the database to quantify its completeness. In particular we report the ranked fraction of languages containing a word for a specific meaning vs. the rank (left panel) and the ranked fraction of pairs of languages sharing a word (not necessarily a cognate) for a specific meaning vs. the rank (right panel). The second measure helps in understanding how accurate is, from a statistical point of view, computing the distance between two languages averaging the Levenshtein distances of all the words for homologous meanings. It is evident the extreme completeness of the database for lists up to meanings.


On the accuracy of language trees.

Pompei S, Loreto V, Tria F - PLoS ONE (2011)

Top: Statistics of the ASJP database.(left panel) Fraction-rank plot: for each word in the lists of words of the Automated Similarity Judgement Project (ASJP), we measured the fraction of languages containing it. The plot reports this fraction vs. its rank. In the -items lists in the ASJP database, only  meanings are shared by almost  of the languages for each family. (right panel) Ranked fraction of pairs of languages sharing each specific word vs. rank. Again only  meanings are shared by almost  of the pairs of languages. Bottom: Statistical measures on the ABVD database. (left panel) Fraction-rank plot: for each word in the lists of words of the Austronesian Basic Vocabulary Database (ABVD), we measured the fraction of languages containing it. The plot reports this fraction vs. its rank. (right panel) Ranked fraction of pairs of languages sharing each specific word vs. rank. For sake of a rough comparison we also reported the same quantities measured on the Austronesian family of the ASJP database. The ASJP includes  words up to a maximum of almost  of the languages, whereas in the ABVD the percentage of coverage is at least of  for almost all the words in the list. Limited to the  most shared words the ASJP database features a slightly larger coverage than the ABVD database.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3108590&req=5

pone-0020109-g002: Top: Statistics of the ASJP database.(left panel) Fraction-rank plot: for each word in the lists of words of the Automated Similarity Judgement Project (ASJP), we measured the fraction of languages containing it. The plot reports this fraction vs. its rank. In the -items lists in the ASJP database, only meanings are shared by almost of the languages for each family. (right panel) Ranked fraction of pairs of languages sharing each specific word vs. rank. Again only meanings are shared by almost of the pairs of languages. Bottom: Statistical measures on the ABVD database. (left panel) Fraction-rank plot: for each word in the lists of words of the Austronesian Basic Vocabulary Database (ABVD), we measured the fraction of languages containing it. The plot reports this fraction vs. its rank. (right panel) Ranked fraction of pairs of languages sharing each specific word vs. rank. For sake of a rough comparison we also reported the same quantities measured on the Austronesian family of the ASJP database. The ASJP includes words up to a maximum of almost of the languages, whereas in the ABVD the percentage of coverage is at least of for almost all the words in the list. Limited to the most shared words the ASJP database features a slightly larger coverage than the ABVD database.
Mentions: The Automated Similarity Judgement Program (ASJP) [27] includes -items word lists of about 50 families of languages throughout the world. These lists are written in a standardized orthography (ASJP code) which employs only symbols of the standard QWERTY keyboard, defining vowels, consonants and phonological features. The full database is available at http://email.eva.mpg.de/~wichmann/ASJPHomePage.htm. Figure 2 (top) reports two statistical measures on the database to quantify its completeness. In particular we report the ranked fraction of languages containing a word for a specific meaning vs. the rank (left panel) and the ranked fraction of pairs of languages sharing a word (not necessarily a cognate) for a specific meaning vs. the rank (right panel). The second measure helps in understanding how accurate is, from a statistical point of view, computing the distance between two languages averaging the Levenshtein distances of all the words for homologous meanings. It is evident the extreme completeness of the database for lists up to meanings.

Bottom Line: This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them.A possible way out is to compare algorithmic inference with expert classifications.Based on these scores we quantify the relative performances of the distance-based algorithms considered.

View Article: PubMed Central - PubMed

Affiliation: Complex Systems Lagrange Lab, Institute for Scientific Interchange, Torino, Italy.

ABSTRACT
Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical, phonological, syntactic) features or characters for many different languages: a set of parallel corpora whose compilation represents a paramount achievement in linguistics. From this perspective the reconstruction of language trees is an example of inverse problems: starting from present, incomplete and often noisy, information, one aims at inferring the most likely past evolutionary history. A fundamental issue in inverse problems is the evaluation of the inference made. A standard way of dealing with this question is to generate data with artificial models in order to have full access to the evolutionary process one is going to infer. This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which model of evolution is the most suitable for them. A possible way out is to compare algorithmic inference with expert classifications. This is the point of view we take here by conducting a thorough survey of the accuracy of reconstruction methods as compared with the Ethnologue expert classifications. We focus in particular on state-of-the-art distance-based methods for phylogeny reconstruction using worldwide linguistic databases. In order to assess the accuracy of the inferred trees we introduce and characterize two generalizations of standard definitions of distances between trees. Based on these scores we quantify the relative performances of the distance-based algorithms considered. Further we quantify how the completeness and the coverage of the available databases affect the accuracy of the reconstruction. Finally we draw some conclusions about where the accuracy of the reconstructions in historical linguistics stands and about the leading directions to improve it.

Show MeSH