Limits...
Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words.

Altmann EG, Pierrehumbert JB, Motter AE - PLoS ONE (2009)

Bottom Line: The extent of this deviation depends strongly on semantic type -- a measure of the logicality of each word -- and less strongly on frequency.We develop a generative model of this behavior that fully determines the dynamics of word usage.Because the use of words provides a uniquely precise and powerful lens on human thought and activity, our findings also have implications for other overt manifestations of collective human dynamics.

View Article: PubMed Central - PubMed

Affiliation: Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL, USA.

ABSTRACT

Background: Zipf's discovery that word frequency distributions obey a power law established parallels between biological and physical processes, and language, laying the groundwork for a complex systems perspective on human communication. More recent research has also identified scaling regularities in the dynamics underlying the successive occurrences of events, suggesting the possibility of similar findings for language as well.

Methodology/principal findings: By considering frequent words in USENET discussion groups and in disparate databases where the language has different levels of formality, here we show that the distributions of distances between successive occurrences of the same word display bursty deviations from a Poisson process and are well characterized by a stretched exponential (Weibull) scaling. The extent of this deviation depends strongly on semantic type -- a measure of the logicality of each word -- and less strongly on frequency. We develop a generative model of this behavior that fully determines the dynamics of word usage.

Conclusions/significance: Recurrence patterns of words are well described by a stretched exponential distribution of recurrence times, an empirical scaling that cannot be anticipated from Zipf's law. Because the use of words provides a uniquely precise and powerful lens on human thought and activity, our findings also have implications for other overt manifestations of collective human dynamics.

Show MeSH

Related in: MedlinePlus

Stretched exponential recurrence time distributions observed in different databases.The databases consist of the documentary novel Os Sertões by Euclides da Cunha (S), in Portuguese (); the USENET group comp.os.linux.misc (U) between Aug.  and Mar.  (); the three Obama-McCain debates of the 2008 United States presidential election (D) arranged in chronological order (); an English edition of the novel War and Peace by Leon Tolstoy (W) (); and the first English edition of Isaac Newton's Principia (P) (). All words appearing more than  times were considered in S ( words), D ( words), P ( words), and W ( words), whereas in U all  words appearing more than  times were used (see Text S1, Databases). (a) Recurrence time distributions for the words quase in S (), simple in U (), would in D (), voices in W (), and diameter in P (). (b) Histograms of the fitted  for all datasets. Due to sample size limits, the analysis into semantic Classes is not feasible for the smaller datasets. (c) Box-plots of the coefficient of determination  of the corresponding stretched exponential fit.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2770836&req=5

pone-0007678-g003: Stretched exponential recurrence time distributions observed in different databases.The databases consist of the documentary novel Os Sertões by Euclides da Cunha (S), in Portuguese (); the USENET group comp.os.linux.misc (U) between Aug. and Mar. (); the three Obama-McCain debates of the 2008 United States presidential election (D) arranged in chronological order (); an English edition of the novel War and Peace by Leon Tolstoy (W) (); and the first English edition of Isaac Newton's Principia (P) (). All words appearing more than times were considered in S ( words), D ( words), P ( words), and W ( words), whereas in U all words appearing more than times were used (see Text S1, Databases). (a) Recurrence time distributions for the words quase in S (), simple in U (), would in D (), voices in W (), and diameter in P (). (b) Histograms of the fitted for all datasets. Due to sample size limits, the analysis into semantic Classes is not feasible for the smaller datasets. (c) Box-plots of the coefficient of determination of the corresponding stretched exponential fit.

Mentions: In Fig. 3 we verify our main results using databases of different sizes and characterized by different levels of formality. We analyzed a second example of a USENET group (U), a series of political debates (D), two novels (S,W), and a technical book (P) (for word-specific results see Table S1). The stretched exponential provides a close fit for frequent words in these datasets [Fig. 3(a,c)], and a wide and smoothly varying range of s is observed in each case [Fig. 3(b)]. The technical book exhibits lower values, which can be attributed to the predominance of specific scientific terms. These datasets include examples of texts differing by almost four orders of magnitudes in size, generated by a single author (books), a few authors (debates) or a large number of authors (USENET), in writing and speech (e.g., books vs. debates), and in different languages (e.g., novels), indicating that the stretched exponential scaling is robust with regard to sample size, number of authors, language mode, and language.


Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words.

Altmann EG, Pierrehumbert JB, Motter AE - PLoS ONE (2009)

Stretched exponential recurrence time distributions observed in different databases.The databases consist of the documentary novel Os Sertões by Euclides da Cunha (S), in Portuguese (); the USENET group comp.os.linux.misc (U) between Aug.  and Mar.  (); the three Obama-McCain debates of the 2008 United States presidential election (D) arranged in chronological order (); an English edition of the novel War and Peace by Leon Tolstoy (W) (); and the first English edition of Isaac Newton's Principia (P) (). All words appearing more than  times were considered in S ( words), D ( words), P ( words), and W ( words), whereas in U all  words appearing more than  times were used (see Text S1, Databases). (a) Recurrence time distributions for the words quase in S (), simple in U (), would in D (), voices in W (), and diameter in P (). (b) Histograms of the fitted  for all datasets. Due to sample size limits, the analysis into semantic Classes is not feasible for the smaller datasets. (c) Box-plots of the coefficient of determination  of the corresponding stretched exponential fit.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2770836&req=5

pone-0007678-g003: Stretched exponential recurrence time distributions observed in different databases.The databases consist of the documentary novel Os Sertões by Euclides da Cunha (S), in Portuguese (); the USENET group comp.os.linux.misc (U) between Aug. and Mar. (); the three Obama-McCain debates of the 2008 United States presidential election (D) arranged in chronological order (); an English edition of the novel War and Peace by Leon Tolstoy (W) (); and the first English edition of Isaac Newton's Principia (P) (). All words appearing more than times were considered in S ( words), D ( words), P ( words), and W ( words), whereas in U all words appearing more than times were used (see Text S1, Databases). (a) Recurrence time distributions for the words quase in S (), simple in U (), would in D (), voices in W (), and diameter in P (). (b) Histograms of the fitted for all datasets. Due to sample size limits, the analysis into semantic Classes is not feasible for the smaller datasets. (c) Box-plots of the coefficient of determination of the corresponding stretched exponential fit.
Mentions: In Fig. 3 we verify our main results using databases of different sizes and characterized by different levels of formality. We analyzed a second example of a USENET group (U), a series of political debates (D), two novels (S,W), and a technical book (P) (for word-specific results see Table S1). The stretched exponential provides a close fit for frequent words in these datasets [Fig. 3(a,c)], and a wide and smoothly varying range of s is observed in each case [Fig. 3(b)]. The technical book exhibits lower values, which can be attributed to the predominance of specific scientific terms. These datasets include examples of texts differing by almost four orders of magnitudes in size, generated by a single author (books), a few authors (debates) or a large number of authors (USENET), in writing and speech (e.g., books vs. debates), and in different languages (e.g., novels), indicating that the stretched exponential scaling is robust with regard to sample size, number of authors, language mode, and language.

Bottom Line: The extent of this deviation depends strongly on semantic type -- a measure of the logicality of each word -- and less strongly on frequency.We develop a generative model of this behavior that fully determines the dynamics of word usage.Because the use of words provides a uniquely precise and powerful lens on human thought and activity, our findings also have implications for other overt manifestations of collective human dynamics.

View Article: PubMed Central - PubMed

Affiliation: Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL, USA.

ABSTRACT

Background: Zipf's discovery that word frequency distributions obey a power law established parallels between biological and physical processes, and language, laying the groundwork for a complex systems perspective on human communication. More recent research has also identified scaling regularities in the dynamics underlying the successive occurrences of events, suggesting the possibility of similar findings for language as well.

Methodology/principal findings: By considering frequent words in USENET discussion groups and in disparate databases where the language has different levels of formality, here we show that the distributions of distances between successive occurrences of the same word display bursty deviations from a Poisson process and are well characterized by a stretched exponential (Weibull) scaling. The extent of this deviation depends strongly on semantic type -- a measure of the logicality of each word -- and less strongly on frequency. We develop a generative model of this behavior that fully determines the dynamics of word usage.

Conclusions/significance: Recurrence patterns of words are well described by a stretched exponential distribution of recurrence times, an empirical scaling that cannot be anticipated from Zipf's law. Because the use of words provides a uniquely precise and powerful lens on human thought and activity, our findings also have implications for other overt manifestations of collective human dynamics.

Show MeSH
Related in: MedlinePlus