Limits...
Universal entropy of word ordering across linguistic families.

Montemurro MA, Zanette DH - PLoS ONE (2011)

Bottom Line: We computed a relative entropy measure to quantify the degree of ordering in word sequences from languages belonging to several linguistic families.While a direct estimation of the overall entropy of language yielded values that varied for the different families considered, the relative entropy quantifying word ordering presented an almost constant value for all those families.Our results indicate that despite the differences in the structure and vocabulary of the languages analyzed, the impact of word ordering in the structure of language is a statistical linguistic universal.

View Article: PubMed Central - PubMed

Affiliation: The University of Manchester, Manchester, United Kingdom. M.Montemurro@manchester.ac.uk

ABSTRACT

Background: The language faculty is probably the most distinctive feature of our species, and endows us with a unique ability to exchange highly structured information. In written language, information is encoded by the concatenation of basic symbols under grammatical and semantic constraints. As is also the case in other natural information carriers, the resulting symbolic sequences show a delicate balance between order and disorder. That balance is determined by the interplay between the diversity of symbols and by their specific ordering in the sequences. Here we used entropy to quantify the contribution of different organizational levels to the overall statistical structure of language.

Methodology/principal findings: We computed a relative entropy measure to quantify the degree of ordering in word sequences from languages belonging to several linguistic families. While a direct estimation of the overall entropy of language yielded values that varied for the different families considered, the relative entropy quantifying word ordering presented an almost constant value for all those families.

Conclusions/significance: Our results indicate that despite the differences in the structure and vocabulary of the languages analyzed, the impact of word ordering in the structure of language is a statistical linguistic universal.

Show MeSH
Entropy distributions for corpora belonging to three                            languages.Each panel shows the distribution of the entropy of the random texts                            lacking linguistic structure (blue); that of the original texts (green);                            and that of the relative entropy (red). The three languages: Chinese,                            English, and Finnish, were chosen because they had the largest corpora                            in three different linguistic families. In panels A, B, and C, the                            random texts were obtained by randomly shuffling the words in the                            original ones. In panels D, E, and F, the random texts were generated                            using the words frequencies in the original texts.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3094390&req=5

pone-0019875-g001: Entropy distributions for corpora belonging to three languages.Each panel shows the distribution of the entropy of the random texts lacking linguistic structure (blue); that of the original texts (green); and that of the relative entropy (red). The three languages: Chinese, English, and Finnish, were chosen because they had the largest corpora in three different linguistic families. In panels A, B, and C, the random texts were obtained by randomly shuffling the words in the original ones. In panels D, E, and F, the random texts were generated using the words frequencies in the original texts.

Mentions: In Figure 1 we show the distribution of the entropy of individual texts obtained for three languages belonging to different linguistic families. In each of the upper panels, the rightmost distribution corresponds to the entropy of shuffled texts. The central distribution in each panel corresponds to the entropy of the original texts. This entropy contains contributions both from the words' frequencies regardless of their order and from the correlations emerging from word order. Note that the displacement between the two distributions is only a consequence of word ordering. Finally, the leftmost distribution in the upper panels of Figure 1 corresponds to the relative entropy Ds between the original and shuffled texts in each language.


Universal entropy of word ordering across linguistic families.

Montemurro MA, Zanette DH - PLoS ONE (2011)

Entropy distributions for corpora belonging to three                            languages.Each panel shows the distribution of the entropy of the random texts                            lacking linguistic structure (blue); that of the original texts (green);                            and that of the relative entropy (red). The three languages: Chinese,                            English, and Finnish, were chosen because they had the largest corpora                            in three different linguistic families. In panels A, B, and C, the                            random texts were obtained by randomly shuffling the words in the                            original ones. In panels D, E, and F, the random texts were generated                            using the words frequencies in the original texts.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3094390&req=5

pone-0019875-g001: Entropy distributions for corpora belonging to three languages.Each panel shows the distribution of the entropy of the random texts lacking linguistic structure (blue); that of the original texts (green); and that of the relative entropy (red). The three languages: Chinese, English, and Finnish, were chosen because they had the largest corpora in three different linguistic families. In panels A, B, and C, the random texts were obtained by randomly shuffling the words in the original ones. In panels D, E, and F, the random texts were generated using the words frequencies in the original texts.
Mentions: In Figure 1 we show the distribution of the entropy of individual texts obtained for three languages belonging to different linguistic families. In each of the upper panels, the rightmost distribution corresponds to the entropy of shuffled texts. The central distribution in each panel corresponds to the entropy of the original texts. This entropy contains contributions both from the words' frequencies regardless of their order and from the correlations emerging from word order. Note that the displacement between the two distributions is only a consequence of word ordering. Finally, the leftmost distribution in the upper panels of Figure 1 corresponds to the relative entropy Ds between the original and shuffled texts in each language.

Bottom Line: We computed a relative entropy measure to quantify the degree of ordering in word sequences from languages belonging to several linguistic families.While a direct estimation of the overall entropy of language yielded values that varied for the different families considered, the relative entropy quantifying word ordering presented an almost constant value for all those families.Our results indicate that despite the differences in the structure and vocabulary of the languages analyzed, the impact of word ordering in the structure of language is a statistical linguistic universal.

View Article: PubMed Central - PubMed

Affiliation: The University of Manchester, Manchester, United Kingdom. M.Montemurro@manchester.ac.uk

ABSTRACT

Background: The language faculty is probably the most distinctive feature of our species, and endows us with a unique ability to exchange highly structured information. In written language, information is encoded by the concatenation of basic symbols under grammatical and semantic constraints. As is also the case in other natural information carriers, the resulting symbolic sequences show a delicate balance between order and disorder. That balance is determined by the interplay between the diversity of symbols and by their specific ordering in the sequences. Here we used entropy to quantify the contribution of different organizational levels to the overall statistical structure of language.

Methodology/principal findings: We computed a relative entropy measure to quantify the degree of ordering in word sequences from languages belonging to several linguistic families. While a direct estimation of the overall entropy of language yielded values that varied for the different families considered, the relative entropy quantifying word ordering presented an almost constant value for all those families.

Conclusions/significance: Our results indicate that despite the differences in the structure and vocabulary of the languages analyzed, the impact of word ordering in the structure of language is a statistical linguistic universal.

Show MeSH