=Paper= {{Paper |id=Vol-1197/paper9 |storemode=property |title=Zipf's Law for LiveJournal |pdfUrl=https://ceur-ws.org/Vol-1197/paper9.pdf |volume=Vol-1197 |dblpUrl=https://dblp.org/rec/conf/aist/Trifonov14 }} ==Zipf's Law for LiveJournal== https://ceur-ws.org/Vol-1197/paper9.pdf
                    Zipf’s Law for LiveJournal

                                  Nikita N. Trifonov

                       Kazan Federal University, Kazan, Russia
                              nikita-trif@yandex.ru



       Abstract. The paper provides an overview of research of frequency of
       language units on the material of the LiveJournal corpus. The corpus
       includes texts on Russian language from 2002 to 2014 year, totaling more
       than 5 million words of articles written by 2 thousand authors. Research
       was held in the following main directions, represented in the present
       work: estimation of coefficients for the Zipf’s law for different authors,
       estimation of coefficients for the Zipf’s law for the total number of words
       in all the analyzed articles.

       Keywords: Zipf’s law, LiveJournal corpus, frequency of words, rank
       distribution.


1     Introduction
Perhaps the most famous statistical distribution in linguistics is Zipf’s law: in any
large enough text, the frequency ranks (starting from the highest) of wordforms
or lemmas are inversely proportional to the corresponding frequencies [1]:

                                     f (r) ∗ r = c,                                  (1)

where f (r) is the frequency of the unit (wordform or lemma) having the rank r
and c is a constant. With Mandelbrots improvements to Zipf’s law, the formula
(1) has next form [2]:
                                            c
                                    f (r) = γ ,                                 (2)
                                           r
where γ is the exponent coefficient (near to 1). Zipf’s law is most easily observed
by plotting the data on a log − log graph, with the axes being log(rank order)
and log(f requency). After taking the logarithm of the formula (2) :

                               ln(f (r)) = C − γln(r),                               (3)

   The LiveJournal source chosen to collect the corpus because it makes it
possible to explore articles as a whole and separately for each author.
   The LiveJournal1 (LJ) is a social network owned by SUP Media where In-
ternet users can keep a blog, journal or diary, and is also the name of the free
and open source server software which runs the LiveJournal website and online
community.
1
    http://www.livejournal.com/




                                           54
    In order to collect corpus of LiveJournal, created a program that gets the
text of articles written by one author, saves the text in a database and goes over
to another author for further information gathering.


2    Experimental Results
The graph plotted (Fig. 1) using for Zipf’s law the points: xr = log r, yr =
log f (r) where r = 1 . . . n, and n is the number of different units (wordforms or
lemmas). The Ordinary Least Squares used to approximate such a graph by a
straight line y = ax + b, where a and b correspond to γ and C for Zipf’s law (the
formula (3)) .




    Fig. 1. The Zipf’s law for the total number of words in all the analyzed articles.


    The graph (Fig. 1) can be divided into three different parts. The first part is a
nuclear zone consisting of the most frequently used words in the Russian language
- prepositions, pronouns, etc. The central part of the graph very important for
exploring. It is most accurately described by Zipf’s law. The last part, called
"zone of truncation " consists of words which do not carry meaning, rarely used
terms and grammatical errors.
    As the graph shows, the zone of truncation affects the result of the approx-
imation, and γ coefficient in approximating line is differs from the expected.
However, if we do not consider the zone of truncation, approximation line al-
most merges with the graph of frequency distribution, and γ coefficient satisfies
the improvements of Mandelbrot for Zipf’s law.
    Ten authors, who written the highest number of letters in articles, were se-
lected for further researches.




                                           55
    Table 1. List of 10 authors with the highest total number of words in articles

Author page on LiveJournal              The total number of words used by the author
http://eto fake.livejournal.com/                          584061
http://mzadornov.livejournal.com/                         583071
http://cuamckuykot.livejournal.com/                       347962
http://aillarionov.livejournal.com/                       302165
http://mgsupgs.livejournal.com/                           260479
http://matveychev oleg.livejournal.com/                   243579
http://steissd.livejournal.com/                           240800
http://kak eto sdelano.livejournal.com/                   234767
http://annatubten.livejournal.com/                        225701
http://adamashek.livejournal.com/                         214743



    For each author from the list given in Table 1 made separate research. These
researches have shown that all of graphs correspond to the Zipf’s law. An example
of this graph you can see in Figure 2.


                                 Rank distribution by Zipf’s law
                      11
                                                                   rank distribution
                                                                   γ : −0.90
                      10


                      9


                      8
           ln(f(r))




                      7


                      6


                      5


                      4
                       0   1       2            3           4          5               6
                                              ln(r)

     Fig. 2. Rank distribution by Zipf’s law for http://eto fake.livejournal.com/



   For comparison, made the research of rank distribution of word frequencies of
Zipfs law based on 4 volumes of books Leo Tolstoy’s "War and Peace." (Fig. 3)
   There are differences between the list of the most frequently encountered
words of Leo Tolstoy’s works and the list of the most frequently encountered
words of contemporary authors represented on LiveJournal. These differences are




                                           56
                                         Rank distribution by Zipf’s law
                      12
                                                                           rank distribution
                                                                           γ : −1.06
                      10



                      8
           ln(f(r))




                      6



                      4



                      2



                      0
                       0         2         4            6           8          10          12
                                                      ln(r)

                       Fig. 3. Rank distribution by Zipf’s law for L. N. Tolstoi


related with the difference between the vocabulary of Leo Tolstoy and modern
vocabulary, as well as skills in writing texts.

3   Conclusion
Exponential coefficients of Zipf’s law depend on text volume, genre of the text
and author’s style. The zone of truncation encountered in the research of large
texts or texts written by different authors. Explanation of this phenomenon
needs more investigation.


Acknowledgements. I would like to thank Valery Dmitrievich Solovyev, Ed-
uard Yulyevich Lerner, and Vladimir Vladimirovich Bochkarev for helpful dis-
cussions.

References
1. Zipf, G. K. Human behavior and the principle of least effort. Cambridge, MA,
   Addison-Wesley, 1949, p. 36.
2. Mandelbrot, B. An informational theory of the statistical structure of languages,
   Communication Theory, ed. W. Jackson, Betterworth, 1953, pp. 486502
3. Gelbukh, A., Sidorov, G. Zipf and Heaps Laws’ Coefficients Depend on Language.
   Proc. CICLing-2001, Conference on Intelligent Text Processing and Computational
   Linguistics, February 18–24, 2001, Mexico City. Lecture Notes in Computer Science
   N 2004, ISSN 0302-9743, ISBN 3-540-41687-0, Springer-Verlag, pp. 332–335.




                                                    57
        Закон Ципфа для LiveJournal

                      Никита Н. Трифонов

Казанский (Приволжский) федеральный университет, Казань, Россия
                    nikita-trif@yandex.ru



Аннотация В статье представлен обзор исследований частоты язы-
ковых единиц на материале корпуса LiveJournal. Корпус включает
тексты на русском языке, написанные в период с 2002 по 2014 год.
Были исследованы статьи 2000 авторов, а так же более 5 000 000 сло-
воформ из этих статей. Исследование было проведено в следующих
основных направлениях, представленных в настоящей работе: расчет
коэффициентов закона Ципфа по отдельным авторам, расчет коэф-
фициентов закона Ципфа по всем проанализированным статьям без
дифференциации по авторам.

Ключевые слова: закон Ципфа, корпус LiveJournal, частота встре-
чаемости слов, ранговое распределение.




                                58