Limits...
Zipf's Law for Word Frequencies: Word Forms versus Lemmas in Long Texts.

Corral Á, Boleda G, Ferrer-i-Cancho R - PLoS ONE (2015)

Bottom Line: Zipf's law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems.In all cases Zipf's law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude.We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkable transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies.

View Article: PubMed Central - PubMed

Affiliation: Centre de Recerca Matemàtica, Bellaterra, Barcelona, Spain; Departament de Matemàtiques, Universitat Autònoma de Barcelona, Bellaterra, Barcelona, Spain.

ABSTRACT
Zipf's law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf's law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with different levels of morphological complexity. In all cases Zipf's law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf's law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkable transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies. In contrast, the low-frequency cut-offs are less stable, tending to increase substantially after the transformation.

No MeSH data available.


Related in: MedlinePlus

The lower cut-off for the frequency distribution of lemmas (al) versus the lower cut-off for the frequency distribution of word forms (aw).The line al = aw is also shown (solid line).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4497678&req=5

pone.0129031.g004: The lower cut-off for the frequency distribution of lemmas (al) versus the lower cut-off for the frequency distribution of word forms (aw).The line al = aw is also shown (solid line).

Mentions: As we have done with the exponent γ, we define aw and al as the lower cut-off of the power-law fit for the frequency distributions of words and of lemmas, respectively. Those values are compared in Fig 4. When all texts are considered, a Student t–test for paired samples yields the rejection of the hypothesis that there is no significant difference in the values of aw and al, even if the presence of one possible outlier is taken into account (t = −3.091 and p = 0.015). In fact, aw and al are not independent, as their Pearson correlation is ρ = 0.961 (𝓝 = 10 and p = 0.0014 for the hypothesis ρ = 0, calculated through permutations of one of the variables). These results are not very surprising upon inspection of Fig 4: Most points (aw, al) lay above the diagonal, suggesting a tendency for al to exceed aw.


Zipf's Law for Word Frequencies: Word Forms versus Lemmas in Long Texts.

Corral Á, Boleda G, Ferrer-i-Cancho R - PLoS ONE (2015)

The lower cut-off for the frequency distribution of lemmas (al) versus the lower cut-off for the frequency distribution of word forms (aw).The line al = aw is also shown (solid line).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4497678&req=5

pone.0129031.g004: The lower cut-off for the frequency distribution of lemmas (al) versus the lower cut-off for the frequency distribution of word forms (aw).The line al = aw is also shown (solid line).
Mentions: As we have done with the exponent γ, we define aw and al as the lower cut-off of the power-law fit for the frequency distributions of words and of lemmas, respectively. Those values are compared in Fig 4. When all texts are considered, a Student t–test for paired samples yields the rejection of the hypothesis that there is no significant difference in the values of aw and al, even if the presence of one possible outlier is taken into account (t = −3.091 and p = 0.015). In fact, aw and al are not independent, as their Pearson correlation is ρ = 0.961 (𝓝 = 10 and p = 0.0014 for the hypothesis ρ = 0, calculated through permutations of one of the variables). These results are not very surprising upon inspection of Fig 4: Most points (aw, al) lay above the diagonal, suggesting a tendency for al to exceed aw.

Bottom Line: Zipf's law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems.In all cases Zipf's law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude.We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkable transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies.

View Article: PubMed Central - PubMed

Affiliation: Centre de Recerca Matemàtica, Bellaterra, Barcelona, Spain; Departament de Matemàtiques, Universitat Autònoma de Barcelona, Bellaterra, Barcelona, Spain.

ABSTRACT
Zipf's law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf's law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with different levels of morphological complexity. In all cases Zipf's law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf's law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkable transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies. In contrast, the low-frequency cut-offs are less stable, tending to increase substantially after the transformation.

No MeSH data available.


Related in: MedlinePlus