<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Probability Analysis of the Vocabulary Size Dynamics Using Google Books Ngram Corpus ?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anastasia Pekina</string-name>
          <email>pekina.96@mail.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yulia Maslennikova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vladimir Bochkarev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kazan Federal University</institution>
          ,
          <addr-line>Kazan</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The article introduces a method for determining a rate of appearance of new words in a language. The method is based on probabilistic estimates of the vocabulary size of a large text corpus. Backward predicted frequencies of rare words are estimated using linear models that are optimized by the maximum likelihood criteria. This approach provides more accurate estimations of frequencies for the earlier periods; the lower the frequency of the word during the analyzed period, the higher the bene t. A posteriori estimates of the frequency probability of appearance of new words were used to clarify the vocabulary size for di erent years and rate of appearance of new words. According to the proposed probabilistic model, it was shown that &gt; 30% of investigated English and Russian word were appeared in the language before the moment when they were identi ed in the Google Books Ngram Corpus.</p>
      </abstract>
      <kwd-group>
        <kwd>Word usage frequencies Prediction Google Books Ngram</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Despite the long history of studying languages, we still do not know, even
approximately, how many words a speci c language contains. Let's consider the
English language. At present, the most complete published English dictionary,
Oxford English Dictionary [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], comprises more than 600,000 words. However,
it is obvious that it contains not all English words. For example, it does not
contain extremely rare words (occurring less than 1 per billion words). Creation
of Google Books Ngram, which contains more than 500 billion English words,
brought hope to researchers to obtain a fairly complete list of words. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], an
attempt was made to estimate the total number of words using this corpus.
Estimations were obtained only at three points. According to their research, the
language contained 554 thousand words in 1900, 597 thousand words in 1950
and 1022 thousand words in 2000. The article [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] shows graphs of the number
of words obtained by linear extrapolation for the remaining years of the 20th
century, without taking into account rare words. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the authors evaluate the
lexicon dynamics of the English language over the past 200 years, based on the
ratio of the number of unique words that appeared in the core language to the
number of all words for each decade using the Corpus of Historical American
English and Google Books Ngram. The authors did not consider rare words that
occur in the corpus less than 300 times over 10 years. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], much attention was
paid to the analysis of the dynamics of the language core. The authors showed
that since 1805 the actual English core has not signi cantly changed (words in
the language core are being updated with a rate of about 30 words per year), and
the speed of occurrence of words that do not enter the language core decreases
with time. Thus, few works are devoted to the analysis of the active vocabulary
in the early years (1800 and earlier), even fewer works take into account rare
words that only enter into circulation. It should be noted that such words may
exist in the language, but not appear in a certain year, due to the limited volume
of texts in the given year.
      </p>
      <p>In this paper, we propose a probabilistic model for clarifying the dynamics
of word formation in English and Russian using the Google Books Ngram data.
This probabilistic approach allows us to take into account the limited amount of
texts related to the earlier period. The suggested method is based on prognostic
estimations of the frequency of use of rare word forms in the past, which are then
used to calculate more plausible estimations of the probability of occurrence of
a word form in a lexicon, taking into account the size of the corpus.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Database</title>
      <p>This research is based on the analysis of words usage dynamics from the Google
books Ngram database. Frequencies of all unique 1-grams from the database
were calculated. Then, English and Russian corpora were analyzed. The English
corpus contained 5.3 million of unique words; the Russian corpus included 4.9
million of unique words. We analyzed dynamics of words that appeared in the
corpora in 1800 year ( 23,000 English words and 20, 000 Russian words).</p>
      <p>
        The Google books Ngram dataset has been criticized for its reliance on
inaccurate Optical character recognition, an overabundance of scienti c literature,
and for including large numbers of incorrectly dated and categorized texts [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Another issue is that Optical character recognition is not always reliable, and
some characters may not be scanned correctly. In particular, systemic errors like
the confusion of "s" and "f" can cause systemic bias. Although Google Ngram
Viewer claims that the results are reliable from 1800 onwards [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        There are few examples of the most popular incorrect 1-grams of the early
English frequency dictionary: "a'dvance", "traditional1", "draw ", "knowtheir",
"ossophagus" etc. 1-grams consist of numbers, not of the Latin letters, possible
missing spaces and the replacement of letters. In this paper, the pre-processing
of the investigated database was carried out. To check the early English 1-grams,
the online Multitran dictionary was used, which is an Internet system containing
electronic dictionaries of more than 14 languages and over 5 million terms in all
language parts of the dictionary [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Each of the 23,000 selected English 1-grams
was checked in the Multitran dictionary. For the early Russian 1-grams, the
dictionary "Open Corpora" was used [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This is the crowdsourcing project to
create morphologically, syntactically and semantically marked corpus of texts in
Russian, fully accessible to researchers. The corpus contains about 5.1 million
words. Having checked the correctness, only 2161 of English 1-grams and 8452
of Russian 1-grams were selected for further analysis. A random checking of
rare words removed from the investigated database showed that actually a large
number of rare words that appeared in the corpora in 1800 were not correctly
recognized or were misspelled in the original Google books Ngram database.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methods and results</title>
      <p>The proposed probabilistic approach is based on the idea of using the predicted
estimations of 1-grams usage frequencies for re ning them in early years.
Typically, frequency estimates for later time are more reliable due to the larger
volume of the database. On the left side of the Fig. 1 a graph of the number of
books written in di erent years and included into the Google Books corpus is
shown (English corpus version 2012.07.01). Therefore, based on the latest data,
it is possible to predict the expected frequencies for earlier periods. This method
is called "backward prediction".</p>
      <p>There are many ways to estimate the prediction coe cients of the
autoregression model; the particular way depends on the uctuation distribution of
the investigated time series. In our case, we are talking about estimating the
usage frequencies of su ciently rare words, and, consequently, we can expect
that uctuations are distributed according to the Poisson law, which is given by
the probability function:</p>
      <p>P (X = k) =
k</p>
      <p>e
k!</p>
      <p>
        The variance and mathematical expectation of a word usage frequency,
distributed according to Poisson's law, are equal to the distribution parameter ,
that depends on the time, therefore we will use the symbol t. The most accurate
estimates of the parameter t can be obtained using the maximum likelihood
method (MLE) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], when the log likelihood function of the following form is
maximized:
(1)
log L(!t) = log w(!Xj!t) = X Xt log t
t
      </p>
      <p>X
t
t</p>
      <p>X Xt!
t
(2)</p>
      <p>
        Similar approach can be applied for a nonlinear prediction using arti cial
neural networks with maximum likelihood training. The approach is proposed
in more detail in the paper [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Use of the maximum likelihood method signi cantly improves the accuracy
of estimated frequencies for rare words if compared to the ordinary least squares
procedure (MSE) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. A limiting form of the Poisson distribution is the Gaussian
distribution. If the value of the parameter is large, then the results of MLE
estimates will be similar to the results of the weighted MSE. Thus, for modelling
of the frequently used words, the MLE approach does not provide signi cant
advantages. To compare the e ectiveness of MLE and MSE methods, statistical
modelling was carried out. We have arti cially modeled the series of random
numbers distributed according to Poisson's law, whose parameter changed
with time according to the law (t) exp( t). The values of the parameters (t)
were chosen by corresponding to real words frequencies from the base. After this,
the parameters were estimated using two methods. The results of calculations,
as expected, show that the lower the frequency of word usage, the greater the
gain from using the MLE method. For example, with an averaged frequency of
0.5 usage per year, the standard deviation of the estimate the parameter is
reduced 2 times compared to the usual estimate based on the average value of
the empirical frequency.
      </p>
      <p>The probability density function of calculated parameters for auto-regressive
models for English 1-grams are shown on the right side of the Fig. 1. It can be
seen that many 1-grams from pre-processed database (Density curve "AFTER")
are widely used in later years, because the center of the probability density
function corresponds to positive value, which is expected for the developing early
language. The maximum of the density curve "BEFORE" (before the database
pre-processing) is located in the negative range of values, since typos and words
with recognition errors in the database do not tend to increase the usage
frequencies (they are presented in the database with approximate constant small
frequencies).
20
ity15
s
e10
n
D
5</p>
      <p>AFTER
BEFORE
ZERO LEVEL
1850</p>
      <p>1900
Years
1950
2000</p>
      <p>Fig. 2 (on the left) shows the predicted usage frequencies of two rare English
words 'shiftlessness' and 'tunnelling' using simple regression linear model of the
rst order with maximal likelihood optimization. The order of the model was
chosen according to the size of the investigated time series (210 points for each
words) because the data is very noisy and the use of higher-order models cannot
be e ective. This model leads to an exponential frequency dependence on time
e . The dependence is plotted in logarithmic scale along the ordinate axis.
Fig. 2 (on the right) shows the same plot for two rare Russian words. We should
note that these Russian and English words were rare in early years. In both
cases, the backward predicted frequencies have positive value for the parameter
of the exponential model, in other words, usage frequencies were increasing after
1800 year.</p>
      <p>ise10-7
c
n
e
u
q
e
fr
ge10-8
a
s
U</p>
      <p>After backward prediction of usage frequencies, it was shown that prediction
errors are distributed approximately lognormally for both databases (English
and Russian). Knowledge of backward predicted values and the error distribution
law makes it possible to estimate the actual usage frequencies by the criterion of
the maximum a posteriori probability. This criterion has a signi cant advantage
over the estimates based on the mean value of empirical frequencies.</p>
      <p>Updated information about usage frequencies for early years allows using
this information to re ne the actual volume of the lexicon and the speed of word
formation. For example, we have registered a word in the corpus for the rst
time. The rst possible reason that the word that was early used in the living
language fell into the corpus because of the increase of its volume. The second
reason is that it can be really a new word. By extrapolating usage frequencies
to the previous years, we can calculate the probability that a word with such
frequency could not be identi ed in a corpus of a known volume. Using such
calculations for each word, allows us to specify the actual lexicon dynamic (and,
correspondingly, the speed of appearance of new words) for di erent years.</p>
      <p>Let f^t be the predicted in the past, the relative usage of a word for a certain
year t, and Nt is the volume of the corpus for that year. Hence the probability
that the word will not occur in this year in the case will be (1 f^t)Nt . Since we
will consider the frequency of word usage to be independent in di erent years,
the total probability P0 of the fact that the word did not appear in the body
before 1800 will be written as a product of probabilities for di erent years:
P0 = Y(1
t
f^t)Nt
(3)</p>
      <p>
        The approach is presented in more detail in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The null hypothesis is that
the word had appeared in the language before the moment when it was rst
identi ed in the corpus. For example, the probability that the word 'shiftlessness'
was not included in the database because of the small volume of the corpus before
1800 is 0.75, and this probability is 0.56 for the word 'tunnelling'. It can be seen
that this probability is quite high.
      </p>
      <p>The probability of 737 English words (34%) from the investigated database
is &gt; 0.5, in other words, these words were appeared before 1800 year with the
probability &gt; 0.5. Analyzing of predicted usage frequencies for these words,
it was found that many of them (&gt;74%) could appear before 1700 year, and
only 26 % were born around 1800 year. 3,100 words (36%) from the Russian
database show the probability &gt; 0.5 (&gt;70% of them could be born before 1700
year).</p>
      <p>According to the proposed probabilistic model, it was shown that the date
of the rst appearance of a word in the corpus does not always coincide with
the date of appearance of this word in the language. For example, more than
30% of words that rst appeared in the corpus in 1800 were highly likely to have
come into use much earlier, but did not enter the corpus before 1800 due to the
insu cient size of the database.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. OED Online Homepage, http://www.oed.com/.
          <source>Last accessed 15 Apr 2018</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aiden</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , The Google Books Team, Pickett,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Hoiberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Clancy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Norvig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Orwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Pinker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Nowak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Aiden</surname>
          </string-name>
          , E.:
          <article-title>Quantitative Analysis of Culture Using Millions of Digitized Books</article-title>
          .
          <source>Science</source>
          <volume>311</volume>
          (
          <issue>6014</issue>
          ),
          <volume>176</volume>
          {
          <fpage>182</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jatowt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tanaka</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Large scale analysis of changes in english vocabulary over recent time</article-title>
          .
          <source>ACM International Conference Proceeding Series</source>
          .
          <volume>2523</volume>
          -
          <fpage>2526</fpage>
          . (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gerlach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Altmann</surname>
          </string-name>
          , EG.:
          <article-title>Stochastic Model for the Vocabulary Growth Natural Languages</article-title>
          .
          <source>Phys. Rev. X</source>
          <volume>3</volume>
          (
          <issue>2</issue>
          ),
          <volume>021006</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Pechenick</surname>
          </string-name>
          , E.:
          <article-title>Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution</article-title>
          .
          <source>PLoS ONE</source>
          <volume>10</volume>
          (
          <issue>10</issue>
          ),
          <source>e0137041 (20151)</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Multitran</given-names>
            <surname>Homepage</surname>
          </string-name>
          , https://www.multitran.ru/.
          <source>Last accessed 15 Apr 2018</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Open Corpora Homepage, http://opencorpora.org/.
          <source>Last accessed 15 Apr 2018</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Jackson</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <source>Digital Filters and Signal Processing. 2nd edn</source>
          . Kluwer Academic Publishers, Boston (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Maslennikova</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bochkarev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voloskov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Modelling of word usage frequency dy-namics using arti cial neural network</article-title>
          .
          <source>Journal of Physics: Conference Series</source>
          <volume>490</volume>
          (
          <issue>1</issue>
          ),
          <volume>012180</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Bochkarev</surname>
          </string-name>
          , V.;
          <string-name>
            <surname>Lerner</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shevlyakova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deviations in the Zipf and Heaps laws in natural languages</article-title>
          .
          <source>Journal of Physics: Conference Series</source>
          <volume>490</volume>
          (
          <issue>1</issue>
          ),
          <volume>012009</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>