<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Zipf 's Law for LiveJournal</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nikita N. Trifonov</string-name>
          <email>nikita-trif@yandex.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kazan Federal University</institution>
          ,
          <addr-line>Kazan</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>54</fpage>
      <lpage>58</lpage>
      <abstract>
        <p>The paper provides an overview of research of frequency of language units on the material of the LiveJournal corpus. The corpus includes texts on Russian language from 2002 to 2014 year, totaling more than 5 million words of articles written by 2 thousand authors. Research was held in the following main directions, represented in the present work: estimation of coefficients for the Zipf's law for different authors, estimation of coefficients for the Zipf's law for the total number of words in all the analyzed articles.</p>
      </abstract>
      <kwd-group>
        <kwd>Zipf's law</kwd>
        <kwd>LiveJournal corpus</kwd>
        <kwd>frequency of words</kwd>
        <kwd>rank distribution</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Perhaps the most famous statistical distribution in linguistics is Zipf’s law: in any
large enough text, the frequency ranks (starting from the highest) of wordforms
or lemmas are inversely proportional to the corresponding frequencies [1]:
where f (r) is the frequency of the unit (wordform or lemma) having the rank r
and c is a constant. With Mandelbrots improvements to Zipf’s law, the formula
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) has next form [2]:
where γ is the exponent coefficient (near to 1). Zipf’s law is most easily observed
by plotting the data on a log − log graph, with the axes being log(rank order)
and log(f requency). After taking the logarithm of the formula (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) :
ln(f (r)) = C − γln(r),
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
      </p>
      <p>In order to collect corpus of LiveJournal, created a program that gets the
text of articles written by one author, saves the text in a database and goes over
to another author for further information gathering.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Experimental Results</title>
      <p>
        The graph plotted (Fig. 1) using for Zipf’s law the points: xr = log r, yr =
log f (r) where r = 1 . . . n, and n is the number of different units (wordforms or
lemmas). The Ordinary Least Squares used to approximate such a graph by a
straight line y = ax + b, where a and b correspond to γ and C for Zipf’s law (the
formula (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )) .
      </p>
      <p>The graph (Fig. 1) can be divided into three different parts. The first part is a
nuclear zone consisting of the most frequently used words in the Russian language
- prepositions, pronouns, etc. The central part of the graph very important for
exploring. It is most accurately described by Zipf’s law. The last part, called
"zone of truncation " consists of words which do not carry meaning, rarely used
terms and grammatical errors.</p>
      <p>As the graph shows, the zone of truncation affects the result of the
approximation, and γ coefficient in approximating line is differs from the expected.
However, if we do not consider the zone of truncation, approximation line
almost merges with the graph of frequency distribution, and γ coefficient satisfies
the improvements of Mandelbrot for Zipf’s law.</p>
      <p>Ten authors, who written the highest number of letters in articles, were
selected for further researches.
Author page on LiveJournal
http://eto fake.livejournal.com/
http://mzadornov.livejournal.com/
http://cuamckuykot.livejournal.com/
http://aillarionov.livejournal.com/
http://mgsupgs.livejournal.com/
http://matveychev oleg.livejournal.com/
http://steissd.livejournal.com/
http://kak eto sdelano.livejournal.com/
http://annatubten.livejournal.com/
http://adamashek.livejournal.com/</p>
      <p>For each author from the list given in Table 1 made separate research. These
researches have shown that all of graphs correspond to the Zipf’s law. An example
of this graph you can see in Figure 2.
) 8
)
fr((
n
l 7
6
5
40
1
2
3
ln(r)
4
5
6
For comparison, made the research of rank distribution of word frequencies of
Zipfs law based on 4 volumes of books Leo Tolstoy’s "War and Peace." (Fig. 3)</p>
      <p>There are differences between the list of the most frequently encountered
words of Leo Tolstoy’s works and the list of the most frequently encountered
words of contemporary authors represented on LiveJournal. These differences are
rank distribution
γ : −1.06
2
4
6
ln(r)
8
10
12
related with the difference between the vocabulary of Leo Tolstoy and modern
vocabulary, as well as skills in writing texts.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>Exponential coefficients of Zipf’s law depend on text volume, genre of the text
and author’s style. The zone of truncation encountered in the research of large
texts or texts written by different authors. Explanation of this phenomenon
needs more investigation.</p>
      <p>Acknowledgements. I would like to thank Valery Dmitrievich Solovyev,
Eduard Yulyevich Lerner, and Vladimir Vladimirovich Bochkarev for helpful
discussions.
Аннотация В статье представлен обзор исследований частоты
языковых единиц на материале корпуса LiveJournal. Корпус включает
тексты на русском языке, написанные в период с 2002 по 2014 год.
Были исследованы статьи 2000 авторов, а так же более 5 000 000
словоформ из этих статей. Исследование было проведено в следующих
основных направлениях, представленных в настоящей работе: расчет
коэффициентов закона Ципфа по отдельным авторам, расчет
коэффициентов закона Ципфа по всем проанализированным статьям без
дифференциации по авторам.
Ключевые слова: закон Ципфа, корпус LiveJournal, частота
встречаемости слов, ранговое распределение.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Zipf</surname>
            ,
            <given-names>G. K.</given-names>
          </string-name>
          <article-title>Human behavior and the principle of least effort</article-title>
          . Cambridge, MA, Addison-Wesley,
          <year>1949</year>
          , p.
          <fpage>36</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mandelbrot</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>An informational theory of the statistical structure of languages</article-title>
          , Communication Theory, ed. W.
          <string-name>
            <surname>Jackson</surname>
            ,
            <given-names>Betterworth</given-names>
          </string-name>
          ,
          <year>1953</year>
          , pp.
          <fpage>486502</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Zipf and Heaps Laws' Coefficients Depend on Language</article-title>
          .
          <source>Proc. CICLing-2001, Conference on Intelligent Text Processing and Computational Linguistics, February 18-24</source>
          ,
          <year>2001</year>
          ,
          <string-name>
            <given-names>Mexico</given-names>
            <surname>City</surname>
          </string-name>
          .
          <source>Lecture Notes in Computer Science N 2004, ISSN 0302-9743, ISBN 3-540-41687-0</source>
          , Springer-Verlag, pp.
          <fpage>332</fpage>
          -
          <lpage>335</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>