<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Linguistic Big Data: Problem of Purity and Representativeness</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Kazan (Volga Region) Federal University</institution>
          ,
          <addr-line>Kazan</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Kazan National Research Technological University</institution>
          ,
          <addr-line>Kazan</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>This paper deals with the quality problem of linguistic big data exemplified by the corpus of Google Books Ngram. The criticism of this corpus has been summarized and discussed. Special attention is paid to the matters of the corpus balance, spelling errors, and errors in metadata. It is also compared to the Russian National Corpus and to the General Internet-Corpus of Russian. A new concept, “diachronically balanced corpus”, has been introduced. The methods are discussed for enhancing the quality of Google Books Ngram.</p>
      </abstract>
      <kwd-group>
        <kwd>Text Corpora</kwd>
        <kwd>Data Representativeness</kwd>
        <kwd>Time-series</kwd>
        <kwd>Data Noisiness</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Modern linguistic studies can hardly go without using large text corpora or linguistic
databases of various types or without applying to them computer-based or mathematical
methods to obtain valid statistics. For the Russian language, the Russian National
Corpus [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] (abbreviated as RNC) is well known, which is specified in [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. RNC contains
over 350 million words [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It is carefully checked, and its part has been disambiguated
manually. All this make it exceptionally useful for research in the Russian language.
      </p>
      <p>A new, interesting resource appeared recently, the General Internet-Corpus of
Russian (http://www.webcorpora.ru/, abbreviated as GICR). It already contains over 20
billion words and is thought to be further expanded. GICR includes the contents from
the largest resources of the Runet, such as Zhurnalny Zal (Room of Thick Literary
Journals), Novosti (News), VKontakte, LiveJournal, and Blogs at Mail.ru.</p>
      <p>
        Since 2009, there has been an even larger corpus, Google Books Ngram
(https://books.google.com/ngrams, abbreviated as GBN). It contains data on 9
languages, including Russian. The volume of the Russian GBN sub-corpus is over 67
billion words, while it exceeds 500 billion for the English one. GBN was created by fully
scanning, followed by text recognition, all books from over 40 largest libraries in the
world, including those of Harvard University and Oxford University. As a result, 30
million books were digitalized, of which the 8 million best digitalized ones were
selected to form the corpus, which amounted to 6 % of all the books published worldwide
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A detailed description of the GBN can be found in [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4–6</xref>
        ]. For so large corpora, there
seems no escaping the matters of their quality and of the possibility of errors in creating
them. In this paper, we are focusing on GBN as the largest corpus of all the corpora
existing in the world. Some publications noted errors in the corpus [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7–9</xref>
        ]. There are
three core GBN problems discussed in literature: OCR errors, balance of the corpus,
and errors in metadata. Herein, we are considering the above problems and the possible
ways of improving the corpus. The paper presents a review of key publications related
to the GBN issues, as well as the unique results obtained by the authors in this area.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>OCR Errors</title>
      <p>
        There are recognition errors in the GBN corpus, which are primarily related to ancient
books characterized by poor print quality. In the first GBN version released in 2009,
things were really in a bad way. Thus, in ancient English books, letter s was frequently
recognized as f. For example, the word best was mistaken for beft in up to 50 % of the
17th-century books recognized. The creator of this resource, Google, has considered the
criticism and considerably improved the recognition quality. Scanning devices were
upgraded every six months [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. As a result, in the second version in 2012, best was
incorrectly recognized as beft in just 10% of cases in the 17th-century books, while in
contemporary books of 2000, the amount of errors only made 0.02%, that is, it was
rather low and could not affect any statistics regarding the frequency of using the word
best.
      </p>
      <p>Similarly, that is the case for the Russian language. We have considered several
dozens of randomly chosen words containing recognition errors. Typically, the error rate
does not exceed 0.1 %. For example, letter н is sometimes recognized as и. In Fig, 1,
the exemplary frequency diagrams are shown for the word “иней” (hoarfrost). The
frequency of its incorrect recognition as “иией” is lower than 0.1% of the correct one.</p>
      <p>For Russian, certain difficulties occur with the pre-reform (before 1918)
orthography. The Russian language previously used letters, such as Ѣ (yat) and Ѳ (fita), that are
incorrectly recognized in GBN. For the data beyond 1918, the problem is eliminated.
For many words from texts issued before 1918, the data will also be correct, since letters
Ѣ and Ѳ, as well as other elements of the old orthography, are rather uncommon.</p>
      <p>Thus, there are no apparent reasons for considering that recognition errors may
essentially affect the results of counting the frequency of word usage, except for some
probable rare cases with ancient books, where certain care must be exercised.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Balance</title>
      <p>
        As a matter of principle, the problem of corpus balance is considerably more
complicated. Balanced should be considered a corpus, in which all types of texts, i.e., literary,
journalistic, pedagogic, scientific, business, and other texts, occur in the corpus
proportionally to their shares within the texts of the chosen period [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It is commonly
supposed that the RNC is well-balanced, which is ensured by the efforts of its developers
who have “hand-picked” the texts for the corpus. GBN was created by a very different
technique, its composition was not specially designed, so GBN is often faulted for being
unbalanced [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ].
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a radical opinion is expressed: “Therefore, instead of speaking about general
linguistic or cultural change, it seems to be preferable to explicitly restrict the results to
linguistic or cultural change ‘as it is represented in the Google Ngram data’”. In fact,
the author of [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is on the same side of the fence, suggesting that a well-balanced corpus
is a utopia and that any data obtained based on a corpus reflects the content of that
corpus rather than the language state.
      </p>
      <p>If this were really the case, the creation of text corpora, on which many efforts have
been focused, would become a little promising activity. Then corpora would be just a
set of examples linguists can extract to quote in their articles, and they would not be
suitable as a tool for fundamental research in the essence of a language. Fortunately,
this is not the case. The best proof of the entire language representation adequacy in
large corpora is the reproducibility of results demonstrated on different corpora. Let us
give a simple example.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the changes were considered regarding the frequency of the members of a
synonymic row that includes the words стараться (try) and пытаться (attempt). In
Figures 2 and 3, the graphs are shown for the most frequently used words from the
inflectional paradigm, i.e., старался and пытался (both are past-tense third-person
singular masculine verbs), for GBN and RNC. The trend of the last two centuries is
clearly in evidence for both corpora – пытался becomes more frequent than старался.
The appearance the graphs in GBN is smoother since this corpus is larger. Even the
period where the word пытался becomes more frequent is the same – around the year
1960. For highly-frequent words, the graphs of GBN and RNC are usually similar, as
well.
Such agreement of results obtained on different corpora both validates the results as
such and indicates the high quality of the corpora and their consistency. Unfortunately,
we cannot directly check everything on GICR within the above timeframe, since almost
all GICR texts are dated the 21st century and it has just started growing deeper recently.
      </p>
      <p>However, there is a curious possibility to perform an indirect comparison. Time
series generated based on diachronic corpora provide a great opportunity to predict the
development of the language. So far, no quantitative predictions regarding language
changes have been made based on corpora. In this paper, we are probably making one
of the first attempts of this kind. The scheme proposed for extrapolating time series can
be useful to the research in various language-specific phenomena.</p>
      <p>
        In Table 1, the GBN-based frequencies of the words пытался and старался are
shown in 1978 and in 2008, as well as the ratio of the former of those values to the
latter one. Using the linear regression method, we compute the expected values of the
frequencies for the year 2014. That year was chosen to compare our predictions with
the data of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], in which the time interval is limited to the years 2014–2015 being
available to the authors at the time of writing their work. The increased number of uses of
пытался as compared with старался over a 30-year period in 1978–2008 allows
expecting its further growth by 2014.
Word
пытался
старался
пытался/старался
      </p>
      <p>1978</p>
      <p>
        Let us consider the GICR-based data from [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We divide all the GICR sub-corpora
into three groups that differ in their genres and styles: 1) Zhurnalny Zal that contains
texts from literary journals and is the closest one to GBN; 2) Novosti that contains the
texts of another genre, and 3) LiveJournal and VKontakte, both containing texts that
fundamentally differ from book texts. Hence, we can expect that data for Zhurnalny Zal
will be similar to that of GBN, while the data for Novosti and for the sub-corpora of
the third group will be different.
      </p>
      <p>
        Indeed, the ratio of the пытался usage number to the старался usage number
makes 1.94 in Zhurnalny Zal, 10.41 in Novosti, and 3.30 in the sub-corpora of the third
group [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Thus, we can see that the text genres and styles are, of course, of great
importance. At the same time, for the texts of a similar nature, such as books from GBN
and articles from literary journals, the values predicted based on GBN and the real
values have turned out to be very close to each other, differing by less than 6%. This also
indicates the high quality of the corpora being compared and the possibility to obtain
rather correct predictions based on GBN.
      </p>
      <p>
        Unfortunately, not all the studies performed on GBN can be repeated on RNC or
GICR. This is because, unlike GBN, the RNC and GICR corpora are not available to
users for downloading. This limits the possibilities of processing the RNC and GICR
data with simple queries and does not allow applying complex computer-aided and
mathematical data-processing methods that are widely used in contemporary research.
The latter ones include measuring the distances between languages at some point in
time or between the states of a single language at different time instants, using
Kullback-Leibler’s metrics [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>In our opinion, GBN is exactly an example of the best-balanced corpus, as
wellbalanced as possible. Since all the books from the largest libraries were scanned when
creating it, this results in all types of texts being represented in GBN proportionally to
the representativeness thereof in the libraries. So GBN is as balanced as the entire
human-created library system is balanced.This result cannot be achieved by manually
selecting texts.</p>
      <p>Let us discuss the term of balance. Here, the balance shall mean the generally
balanced corpus as a cohesive whole. In the paper, we introduce a new term: Diachronic
balance corpus.</p>
      <p>We will apply the concept of “diachronically balanced” to a diachronic corpus that
is balanced for any given moment of time, ideally for every year or decade. That is, a
corpus sample within any small timespan shall already be balanced as such.
Until now, the problem of creating diachronically balanced corpora has not even been
stated. However, the giant volume of GBN, as well as the adopted ideology of total
scanning, make this corpus exactly like that. For Russian, over the past decades, the
volume of the corpus has made about 1 billion words per year, which is triple as much
as the volume of the entire RNC. For English, the corpus volume is 10 times more.
Naturally, we cannot prove the diachronic balance of GBN, since there is no operational
definition of balance, which would allow us to consider corpora as balanced or
unbalanced ones.</p>
      <p>Let us consider a specific example demonstrating the degree of the GBN balance as
compared with RNC. In the USSR of late 1980s, the word ускорение (acceleration)
borrowed from physics was embedded in the political vocabulary, which word meant
the accelerated development of the country’s economy. This term started to be widely
used in political essays after April 23, 1985 on which day M.S. Gorbachev declared at
the Plenum of Central Committee of the Communist Party of the USSR (CC CPSU) a
large-scaled program of reforms under the slogan of accelerating the social and
economic development of the country. However, just 2 years later, in January 1987, at the
Plenum of the CC CPSU the task was stated aimed at cardinally reconstructing the
economy management. The new slogan of перестройка (reconstruction) appeared,
and ускорение started becoming irrelevant. Let us have a look at how frequently the
word ускорение was used in GBN and RNC.</p>
      <p>
        In Fig. 4 above, we can see that the sharp rise in the frequency of using the word
ускорение falls exactly within the year 1985, and its frequency sharply decreases,
starting from 1987. Thus, GBN reflects adequately the volume of socially- and
politicallyfocused literature at that time and exactly reflects the processes running in the society.
It is also noted in [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ] that changes in languages registered in GBN correlate with social
events. What about RNC?
      </p>
      <p>In Fig. 5 above, no growth in the frequency of the word ускорение can be seen in
RNC for that period. Moreover, the frequency of ускорение starts falling in 1985 and
growing in 1987. At the same time, no political texts are found among the specific ones
containing the word ускорение in RNC in those years. This is, of course, just one
example. However, it makes it clear how difficult it is to ensure the diachronic balance in
manually assembling the corpus, and how naturally it occurs by itself in total
digitalizing.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Errors in Metadata</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a metadata error was found in the English sub-corpus Fiction. In the first version,
many scientific books got into it. This was found based on considering the use in the
Fiction corpus the word typical of scientific texts, i.e., ‘Figure’, compared to the word
‘figure’ (lowercased) that may occur in literary works, as well. In Fig. 6, you can see
well the unnatural growth of the uses of ‘Figure’, which corresponds in time with the
exponential growth of scientific publications.
      </p>
      <p>This was considered in the second version of the corpus, and the books were
classified correctly. Therefore, the frequency of using the word ‘Figure’ in the Fiction
subcorpus fell 20 times (Fig. 7).
Thus, in that case, again, Google rapidly responded to criticism, and the error was
corrected.</p>
      <p>
        Further, the authors of [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] returned to using GBN in their studies of the language
evolution [
        <xref ref-type="bibr" rid="ref11 ref34">11, 34</xref>
        ].
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Using GBN</title>
      <p>Despite the problems of the corpus mentioned above, it is widely used in various
linguistic and culturological studies. There are over 6,500 articles mentioning GBN in the
Google Scholar system. 187 works have already been published within the first 3.5
months of 2019. Any review of those works is far beyond the scope of this paper.
However, we would like to note some of the most interesting and typical, in our opinion,
trends in research, demonstrating the considerable room for using GBN in Digital
Humanities.</p>
      <p>
        In linguistics, the matters have been studied, such as the number and the changes in
the number of words within the language vocabulary [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the dynamics in the “births”
and “deaths” of words [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the speed of evolving the languages and their vocabularies
[
        <xref ref-type="bibr" rid="ref14 ref9">9, 14</xref>
        ], the mechanisms of competing the regular and irregular forms of verbs in
English [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and comparing British English and American English [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        In psychology, emotions [
        <xref ref-type="bibr" rid="ref15 ref16 ref17 ref18">15–18</xref>
        ] and cognitive processes [
        <xref ref-type="bibr" rid="ref19 ref20 ref21">19–21</xref>
        ] have been
studied. One of the most popular matters turned out to be the changes in the psychology of
collectivism/individualism. Of many works in this area, we would like to note articles
[
        <xref ref-type="bibr" rid="ref22 ref23 ref24 ref25">22–25</xref>
        ], in which the growth of individualism was tracked in different countries, as
exemplified by English, German, Russian, and Chinese.
      </p>
      <p>
        In social studies, the research has been performed in gender differences and diversity
[
        <xref ref-type="bibr" rid="ref26 ref27">26, 27</xref>
        ] and in global cultural trends [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>Directions in Enhancing the Results</title>
      <p>
        We can propose two ways of enhancing the reliability of the results obtained using
GBN. The first one consists in using and comparing all types of data that can be
extracted from GBN. This particularly includes considering, along with the word itself,
its various inflectional forms [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] and synonyms [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], this is exemplified with
the German word eigen (own, peculiar), which is relatively rare to occur in this form,
while it 35 times more frequently occurs in the form of eigenen. In [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], it is
recommended to use each word with three synonyms selected in the relevant dictionaries of
synonyms. Should your research be of intercultural nature, then it is natural to use
corpora for several languages represented in GBN [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] in order to compare the dynamics
of the frequencies of the same or close terms. For research in English, GBN provides
several corpora, such as general English, American English, British English, and the
Fiction corpus. They can also be used to compare and verify the results obtained. For
example, in [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], the dynamics of the first-person pronoun frequencies can be tracked
using both the English corpus and the fiction corpus.
      </p>
      <p>
        The second way consists in preprocessing raw data provided by GBN. Although this
way is rather labour-consuming, it can still be recommended. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the corpus
preprocessing is described that consists in removing all tokens (character strings) that are
not words. All tokens are deleted that contain numbers or other non-alphabetic
characters, except for apostrophes. (The ‘–’ symbol is processed by the GBN system itself.)
This is probably especially topical for the languages that have undergone spelling
reforms, such as the Russian language. The 1918 reform removed ъ (hard sign) at the
ends of masculine words ending with a consonant. To process those words correctly, it
would be reasonable to delete ъ at the ends of all such words. This is just a realistic way
for a researcher, which allows correctly processing an enormous number of Russian
words – practically all masculine nouns.
      </p>
      <p>We can find other systemic changes in spelling the words and correct the corpus
accordingly in compliance with the current spelling rules. Replacing ancient
orthography with the modern spelling is adopted in RNC. This is reasonable, of course, for the
studies only that do not focus on researching in ancient orthography.</p>
      <p>
        It is unreal, of course, to eliminate all the errors in a multibillion-word corpus.
Therefore, it would be reasonable to try and apply the recently-developed methods of
working with noisy language data [
        <xref ref-type="bibr" rid="ref31 ref32">31, 32</xref>
        ].
7
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>Creating very large specialized and multi-use text corpora is important for both
theoretical and applied research in linguistics and allied areas of knowledge. Very large text
corpora, especially diachronic ones, create fundamentally new opportunities for studies
that just could not have been performed without them. GBN corpus presents very
accurately both the language changes and the processes occurring in the society and
reflected in the language. This allows using this corpus in various humanities research.
Diachronic corpora provide a researcher with the opportunities for both describing the
language properties observed and reasonably predicting about their further
developments.</p>
      <p>Creating such corpora is an extremely complicated and labour-consuming activity,
and the matters regarding the quality of the corpora created emerge inevitably. If texts
recently published using computer-aided techniques are quite “pure,” then the ancient
books and periodicals must be scanned followed by recognizing the characters, which
inherently leads to errors in the corpus. In this paper, we have considered the case of
the currently largest diachronic corpus, GBN. It is shown that the main errors of the
earlier version have already been eliminated in the next version of the corpus. The
remaining specific minor errors are invalidated in statistical computations on a big data
array. However, for Russian, some problems persist, which are related to the ancient
spelling and which it would be reasonable to solve.</p>
      <p>
        As to the most important issue regarding the balance/representativeness of GBN, the
conceivable case for the fact has been made out that the corpus is highly balanced. GBN
is specifically compared with RNC and GICR, which comparison has demonstrated
their high consistency. The latest versions of spelling correction systems may be used,
as well [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ].
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgment</title>
      <p>This research was financially supported by the Russian Foundation for Basic Research
(Grant No. 17-29-09163), the Government Program of Competitive Development of
Kazan Federal University, and through the State Assignment in the Area of Scientific
Activities for Kazan Federal University, agreement No 2.8303.2017/8.9.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Russian</given-names>
            <surname>National</surname>
          </string-name>
          <article-title>Corpus</article-title>
          . http://www.ruscorpora.ru/. (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Russian National Corpus:
          <fpage>2003</fpage>
          -
          <lpage>2005</lpage>
          . Indrik, Moscow. (
          <year>2005</year>
          ).
          <article-title>(in Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Russian National Corpus:
          <fpage>2006</fpage>
          -
          <lpage>2008</lpage>
          . V.A. Plungyan, ed.
          <source>Nestor-Istoriya. St. Petersburg</source>
          . (
          <year>2009</year>
          )
          <article-title>(in Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
          </string-name>
          , J.-B.,
          <string-name>
            <surname>Aiden</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Orwant</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brockman</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Petrov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Syntactic Annotations for the Google Books Ngram Corpus</article-title>
          .
          <source>Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics</source>
          vol,
          <volume>2</volume>
          :
          <string-name>
            <given-names>Demo</given-names>
            <surname>Papers</surname>
          </string-name>
          (ACL '12) (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          et al.:
          <article-title>Quantitative Analysis of Culture Using Millions of Digitized Books</article-title>
          .
          <source>Science</source>
          <volume>331</volume>
          (
          <issue>6014</issue>
          ),
          <fpage>176</fpage>
          -
          <lpage>182</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Aiden</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Michel</surname>
          </string-name>
          , J.-B.:
          <article-title>Uncharted Big Data as a Lens on Human Culture</article-title>
          . Russian edition: Moscow. AST. 352 p. (
          <year>2016</year>
          )
          <article-title>(In Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Belikov</surname>
            ,
            <given-names>V.I.</given-names>
          </string-name>
          :
          <article-title>What and how can a linguist get from digitized texts? In: Sibirsky philologichesky zhurnal [</article-title>
          <source>Siberian Journal of Philology] (3)</source>
          ,
          <fpage>17</fpage>
          -
          <lpage>34</lpage>
          (
          <year>2016</year>
          )
          <article-title>(In Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Koplenig</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets - Reconstructing the composition of the German corpus in times of WWII</article-title>
          ,
          <source>Digital Scholarship in the Humanities</source>
          <volume>32</volume>
          ,
          <fpage>169</fpage>
          -
          <lpage>188</lpage>
          (
          <year>2017</year>
          ). https://doi.org/10.1093/llc/fqv037
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pechenick</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danforth</surname>
            ,
            <given-names>Ch.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dodds</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Sh.</surname>
          </string-name>
          , and
          <string-name>
            <surname>Barrat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution</article-title>
          .
          <source>PLOS ONE</source>
          .
          <volume>10</volume>
          (
          <issue>10</issue>
          ): e0137041 (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Solovyev</surname>
          </string-name>
          , V.D.:
          <article-title>Possible mechanisms of change in the cognitive structure of synonym sets</article-title>
          .
          <source>In: Language and Thought: Contemporary Cognitive Linguistics. A collection of articles. Languages of Slavic Culture</source>
          . Moscow,
          <volume>478</volume>
          -
          <fpage>487</fpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Pechenick</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danforth</surname>
          </string-name>
          , Ch.M., and
          <string-name>
            <surname>Dodds</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Sh</surname>
          </string-name>
          .:
          <article-title>Is language evolution grinding to a halt? The scaling of lexical turbulence in English fiction suggests it is not</article-title>
          .
          <source>J. Comput. Science</source>
          <volume>21</volume>
          ,
          <fpage>24</fpage>
          -
          <lpage>37</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Petersen</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tenenbaum</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Havlin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanley</surname>
            ,
            <given-names>H. E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Perc</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Languages Cool as They Expand: Allometric Scaling and the Decreasing Need for New Words</article-title>
          ,
          <source>Sci. Rep</source>
          .
          <volume>2</volume>
          , 943 p. (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Petersen</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tenenbaum</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Havlin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Stanley</surname>
          </string-name>
          , H.E.:
          <article-title>Statistical laws governing fluctuations in word use from word birth to word death</article-title>
          .
          <source>Scientific Reports</source>
          <volume>2</volume>
          , 313 p. (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Bochkarev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solovyev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Wichmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Universals versus historical contingencies in lexical evolution</article-title>
          .
          <source>J. R. Soc. Interface</source>
          <volume>11</volume>
          (
          <issue>101</issue>
          ). DOI:
          <volume>10</volume>
          .1098/rsif.
          <year>2014</year>
          .
          <volume>0841</volume>
          . (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Acerbi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lampos</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garnett</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bentley</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          :
          <article-title>The expression of emotions in 20th century books</article-title>
          .
          <source>PloS One</source>
          <volume>8</volume>
          (
          <issue>3</issue>
          ), (
          <year>2013</year>
          ),
          <year>e59030</year>
          . https://doi.org/10.1371/journal.pone.
          <source>0059030 PMID: 23527080</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Mohammad</surname>
            ,
            <given-names>S.M.:</given-names>
          </string-name>
          <article-title>From once upon a time to happily ever after: Tracking emotions in mail and books</article-title>
          .
          <source>Decision Support Systems</source>
          <volume>53</volume>
          (
          <issue>4</issue>
          ),
          <fpage>730</fpage>
          -
          <lpage>741</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Morin</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Acerbi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Birth of the cool: a two-centuries decline in emotional expression in Anglophone fiction</article-title>
          .
          <source>Cognition and Emotion</source>
          <volume>31</volume>
          (
          <issue>8</issue>
          ),
          <fpage>1663</fpage>
          -
          <lpage>1675</lpage>
          (
          <year>2017</year>
          ). https://doi.org/10.1080/02699931. (
          <year>2016</year>
          ).
          <source>1260528 PMID: 27910735</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Scheff</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Toward defining basic emotions</article-title>
          .
          <source>Qualitative Inquiry</source>
          <volume>21</volume>
          (
          <issue>2</issue>
          ),
          <fpage>111</fpage>
          -
          <lpage>121</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Ellis</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiseman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Jenkins</surname>
          </string-name>
          , R.:
          <article-title>Mental representations of weekdays</article-title>
          .
          <source>PloS One</source>
          <volume>10</volume>
          (
          <issue>8</issue>
          ),
          <year>e0134555</year>
          (
          <year>2015</year>
          ). https://doi.org/10.1371/journal.pone.0134555 PMID:
          <fpage>26288194</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Hills</surname>
          </string-name>
          , T.T. and
          <string-name>
            <surname>Adelman</surname>
            ,
            <given-names>J. S..:</given-names>
          </string-name>
          <article-title>Recent evolution of learnability in American English from 1800 to 2000</article-title>
          . Cognition 143,
          <fpage>87</fpage>
          -
          <lpage>92</lpage>
          (
          <year>2015</year>
          ). https://doi.org/10.1016/j.cognition.
          <year>2015</year>
          .
          <volume>06</volume>
          .009 PMID:
          <fpage>26117487</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Virues-Ortega</surname>
            <given-names>J</given-names>
          </string-name>
          . and
          <string-name>
            <surname>Pear</surname>
            <given-names>J.J.:</given-names>
          </string-name>
          <article-title>A history of “behavior” and “mind”: Use of behavioral and cognitive terms in the 20th century</article-title>
          .
          <source>The Psychological Record</source>
          <volume>65</volume>
          (
          <issue>1</issue>
          ),
          <fpage>23</fpage>
          -
          <lpage>30</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Greenfield</surname>
            ,
            <given-names>P.M.:</given-names>
          </string-name>
          <article-title>The changing psychology of culture from 1800 through 2000</article-title>
          .
          <source>Psychological Science</source>
          <volume>24</volume>
          (
          <issue>9</issue>
          ),
          <fpage>1722</fpage>
          -
          <lpage>1731</lpage>
          (
          <year>2013</year>
          ). https://doi.org/10.1177/0956797613479387 PMID:
          <fpage>23925305</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and Greenfield,
          <string-name>
            <surname>P.M.</surname>
          </string-name>
          :
          <article-title>Cultural evolution over the last 40 years in China: Using the Google Ngram Viewer to study implications of social and political change for cultural values</article-title>
          .
          <source>International Journal of Psychology</source>
          <volume>50</volume>
          (
          <issue>1</issue>
          ),
          <fpage>47</fpage>
          -
          <lpage>55</lpage>
          (
          <year>2015</year>
          ). https://doi.org/10.1002/ijop.12125
          <source>PMID: 25611928</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Younes</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Reips</surname>
          </string-name>
          , U.-D.:
          <article-title>The changing psychology of culture in German-speaking countries: A Google Ngram study</article-title>
          .
          <source>International Journal of Psychology</source>
          <volume>53</volume>
          ,
          <fpage>53</fpage>
          -
          <lpage>62</lpage>
          (
          <year>2018</year>
          ). https://doi.org/10.1002/ijop. 12428
          <source>PMID: 28474338</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Velichkovsky</surname>
            ,
            <given-names>B.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solovyev</surname>
            ,
            <given-names>V.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bochkarev</surname>
            ,
            <given-names>V.V.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ishkineeva</surname>
            ,
            <given-names>F.F.</given-names>
          </string-name>
          :
          <article-title>Transition to market economy promotes individualistic values: Analysing changes in frequencies of Russian words from 1980 to 2008</article-title>
          .
          <source>International Journal of Psychology</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Del Giudice</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The twentieth century reversal of pink-blue gender coding: A scientific urban legend?</article-title>
          <source>Archives of Sexual Behavior</source>
          <volume>41</volume>
          (
          <issue>6</issue>
          ),
          <fpage>1321</fpage>
          -
          <lpage>1323</lpage>
          (
          <year>2012</year>
          ). https://doi.org/10.1007/s10508-012-0002-z PMID:
          <volume>22821170</volume>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wan</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Qian</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>How have males and females been described over the past two centuries? An analysis of Big-Five personality-related adjectives in the Google English Books</article-title>
          .
          <source>Journal of Research in Personality 76</source>
          ,
          <fpage>6</fpage>
          -
          <lpage>16</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Bochkarev</surname>
            ,
            <given-names>V.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shevlyakova</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Solovyev</surname>
          </string-name>
          , V.D.:
          <article-title>The average word length dynamics as an indicator of cultural changes in society</article-title>
          .
          <source>Social evolution &amp; History</source>
          <volume>14</volume>
          (
          <issue>2</issue>
          ),
          <fpage>153</fpage>
          -
          <lpage>175</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Younes</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Reips</surname>
          </string-name>
          , U.-D.:
          <article-title>Guideline for improving the reliability of Google Ngram studies: Evidence from religious terms</article-title>
          .
          <source>PLoS ONE</source>
          <volume>14</volume>
          (
          <issue>3</issue>
          ),
          <year>e0213554</year>
          (
          <year>2019</year>
          ). https://doi.org/10.1371/journal.pone.
          <volume>0213554</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Twenge</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campbell</surname>
            ,
            <given-names>W.K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Gentile</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Changes in pronoun use in American books and the rise of individualism, 1960-2008</article-title>
          .
          <source>Journal of Cross-Cultural Psychology</source>
          <volume>44</volume>
          (
          <issue>3</issue>
          ),
          <fpage>406</fpage>
          -
          <lpage>415</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Malykh</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lyalin</surname>
          </string-name>
          , V.:
          <article-title>Named Entity Recognition in Noisy Domains</article-title>
          .
          <source>In: The Proceedings of the 2018 International Conference on Artificial Intelligence: Applications and Innovations. ISBN: 978-1-7281-0412-6</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Malykh</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Khakhulin</surname>
          </string-name>
          , T.:
          <article-title>Noise Robustness in Aspect Extraction Task</article-title>
          .
          <source>In: The Proceedings of the 2018 Ivannikov ISPRAS Open Conference</source>
          . (
          <year>2018</year>
          ).
          <source>ISBN: 978-1-7281- 1275-6.</source>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Anisimov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polyakov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makarova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Solovyev</surname>
          </string-name>
          , V.:
          <article-title>Spelling correction in English: Joint use of bi-grams and chunking</article-title>
          .
          <source>2017 Intelligent Systems Conference, IEEE Xplore Digital library</source>
          (
          <year>2018</year>
          ). https://ieeexplore.ieee.org/document/8324234
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Koplenig</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A fully data-driven method to identify (correlated) changes in diachronic corpora</article-title>
          . https://arxiv.org/ftp/arxiv/papers/1508/1508.06374.
          <string-name>
            <surname>pdf</surname>
          </string-name>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>