<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The General Regionally Annotated Corpus of Ukrainian (GRAC, uacorpus.org): Architecture and Functionality</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Kyiv National Linguistic University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper presents the General Regionally Annotated Corpus of Ukrainian, which is publicly available (GRAC: uacorpus.org), searchable online and counts more than 400 million tokens, representing most genres of written texts. It also features regional annotation, i. e. about 50 percent of the texts are attributed with regard to the different regions of Ukraine or countries of the di aspora. If the author is known, the text is linked to their home region(s). The journalistic texts are annotated with regard to the place where the edition is published. This feature differs the GRAC from a majority of general linguistic cor pora.</p>
      </abstract>
      <kwd-group>
        <kwd>Ukrainian language</kwd>
        <kwd>corpus</kwd>
        <kwd>diachronic evolution</kwd>
        <kwd>regional variation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Currently many major national languages have large universal corpora, known as
“reference” corpora (cf. das Deutsche Referenzkorpus – DeReKo) or “national”
corpora (a label dating back ultimately to the British National Corpus). These corpora are
large, representative for different genres of written language, have a certain depth of
(usually morphological and metatextual) annotation and can be used for many differ
ent linguistic purposes.</p>
      <p>The Ukrainian language lacks a publicly available linguistic corpus. Still there is a
need of a corpus in the present-day linguistics. Independently researchers compile dif
ferent corpora of Ukrainian for separate research purposes with different size and
functionality. As the community lacks a universal tool, a researcher may build their
own corpus according to their needs. For example, the team UberText [30] has built a
large Internet-based corpus which is published with shuffled sentences. This
published version of corpus fits well the purpose of statistic analysis or sentence-level
structures but cannot be used for studies of text structure or cohesion in discourse.
The absence of a (reasonably) universal corpus for Ukrainian is still an issue.</p>
      <p>The paper presents the GRAC (uacorpus.org) [25] – the General Regionally
Annotated Corpus of Ukrainian, which is intended to fill in this gap. It is searchable online
and counts more than 400 million tokens, representing most genres of written texts.
About 50 percent of the texts are attributed with regard to the different regions of
Ukraine or countries of the diaspora. This feature differs the GRAC from a majority
of general linguistic corpora. This is due to the heavy regional and dialectal impact on
the development of the norm(s) of Modern Standard Ukrainian at least until about
1950 and to certain extent until now (the problem that is per se a subject of further
studies).</p>
      <p>Our paper has the following structure. The second section discusses the existing
corpora of Ukrainian. In the third section we will proceed to the contents and general
architecture of the GRAC. The fourth section features the regional annotation. The
fifth section discusses the metatextual annotation with regard to the genres, types of
texts, dates and sources. The sixth section is dedicated to the translations that are also
included into the corpus, whereas in the seventh and the eighth one we embark on the
question of the morphological annotation and orthographic regime, which are known
to be deeply interdependent parameters. In the section nine, the search engine and
query are presented. The section ten briefly presents the results and future plans.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Principal Corpora of Ukrainian: an Overview</title>
      <p>The only corpus of written Ukrainian where the dates of creation are specified for all
the texts and these texts are annotated with regard to their region and genre is the
General Regionally Annotated Corpus of Ukrainian (GRAC, http://uacorpus.org).
Using the corpus one can study linguistic phenomena synchronically and diachronically
according to the style and genre, as well as their statistical distribution with regard to
the regions.</p>
      <p>The corpus is developed by the present author in collaboration with Ruprecht von
Waldenfels (Germany), Serhii Yaryhin (Ukraine), Andrii Rysin (USA), Vasyl Starko
(Ukraine), Tymofii Nikolaienko (Ukraine), Mikhail Kruk (USA), Michał Woźniak
(Poland).
2.1</p>
      <sec id="sec-2-1">
        <title>The Ukrainian Text Corpus (KTUM)</title>
        <p>
          This corpus is created in the Laboratory of computer linguistics, Philological Institute
of the Taras Shevchenko Kyiv National University under direction of Natalia Darchuk
since 2003 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. It counts 100 million tokens, including legal texts (1,6 million), aca
demic texts (8,7 million), poetic texts (800 thousand tokens), journalism (40 million),
fiction (36 million).
        </p>
        <p>
          The corpus is available online [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The KTUM is the first publicly available
Ukrainian corpus, searchable online since 2010. One can search it for a word, word
form or grammatical features of a single word or of a two-term combination.
        </p>
        <p>The main disadvantages of the corpus are the type of morphological markup that
demands selection of a fully specified grammatical form whereas separate
morphological features are not searchable (e. g. it is impossible to specify any part of speech
in genitive unless a particular POS is selected). Texts are not annotated with regard to
the date when they are created, only publication dates are provided instead. The
technical base of the corpus is also to be renewed.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Parallel Ukrainian-Russian and Russian-Ukrainian Corpora within the</title>
      </sec>
      <sec id="sec-2-3">
        <title>Russian National Corpus</title>
        <p>
          This corpus is compiled by Maria Shvedova, within the project of the Institute of the
Russian Language, Russian Academy of Sciences, in 2009-2012. It counts 9,3 million
tokens, original texts and translations from 1774 (Russian-language texts by Grigory
Skovoroda) to 2011. The corpora are available for online search [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The
UkrainianRussian part counts 6,5 million tokens: fiction (431 texts), journalism (29 texts),
popular science (10 texts), legal (5 texts), letters of Ukrainian writers (180 texts). The
Russian-Ukrainian part counts 2,8 million tokens: fiction (171), journalism (6 texts),
popular science (6 texts), legal (6 texts). The texts are taken from printed source and
from the web. Textual pairs (original and translation) are aligned
sentence-by-sentence with the free program HunAlign [32]; later this alignment is manually corrected
using the Euclid program [26].
        </p>
        <p>The corpus is searchable by word, word form, grammatical feature or a set of
features. Strings up to ten tokens each are searchable. The search results are represented
as Russian-Ukrainian bilingual pairs with information about source. Within the
concordance contexts can be expanded to three sentences.
2.3</p>
      </sec>
      <sec id="sec-2-4">
        <title>The Corpus Project of the Laboratory of Ukrainian</title>
        <p>
          It is compiled by Natalia Kotsyba, Bohdan Moskalevskyi and Mykhailo Romanenko
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Within the project several corpora with a dedicated morphological analyzer are
developed, viz. a treebank with manually resolved homonymy and manual tagging
(140 thousand tokens), the Zvidusil web corpus with automatic syntactic annotation
(about 3 billion tokens), as well as parallel corpora that count the following size of
foreign texts: Polish (4 million), English (1.5 million), French (0,5 million), German
(190 thousand), Spanish (65 thousand), Portuguese (16 thousand). Morphological
tagging in parallel corpora is made automatically according to the Universal
Dependencies system. The corpora use the NoSketchEngine Platform and are available online
for searching using the interfaces Bonito and KonText.
2.4
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>The Ukrainian Web Corpus of the Leipzig University (Germany)</title>
        <p>
          The corpus counts 1,5 billion tokens, available for online search [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The corpus is
built by means of webcrawling and contains only texts created before 2014 from the
Internet (mostly news). There is no morphological tagging, only word forms are
searchable. The corpus shows textual examples and collocations and plots graphs that
visualize frequencies of word forms co-occurred in a sentence.
2.5
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>Corpus of Spoken Rusyn</title>
        <p>
          The Corpus of Spoken Rusyn is compiled by Achim Rabus and Ruprecht von
Waldenfels in Freiburg University in 2017 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The recordings for the corpus are
made in 2015 in Ukraine (Zakarpattia region), Slovakia, Poland and Hungary. The
recordings are manually annotated, each recorded fragment is accompanied by the
respective aligned transcript which is annotated and searchable. The search is made
only by word form, a regional subcorpus can be customized. The fragments of record
ings can be played and downloaded.
2.6
        </p>
      </sec>
      <sec id="sec-2-7">
        <title>The Brown Corpus of Ukrainian</title>
        <p>This name is given to a corpus compiled by Vasyl Starko, Andrii Rysin et al., after the
well-known Brown corpus of English. It is a small-sized balanced corpus (1 million
tokens) for building a statistical language model used for automatical language
processing. It is currently under development [27].
2.7</p>
      </sec>
      <sec id="sec-2-8">
        <title>The Ubertext Corpora</title>
        <p>The corpora of Ukrainian texts: news, Wikipedia, fiction, web texts [30] are
developed by the Ubertext team and available for downloading with shuffled sentences due
to copyright reasons.
2.8</p>
      </sec>
      <sec id="sec-2-9">
        <title>The Corpus of the Chtyvo Library</title>
        <p>
          It counts about 600 million words [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. It includes automatically recognized texts of
the books from the Chtyvo electronic library without postprocessing. The search is
made by exact query lacking lemmatization, morphological analysis and correction of
OCR mistakes.
2.9
        </p>
      </sec>
      <sec id="sec-2-10">
        <title>Summary</title>
        <p>The main properties of the corpora described in the present section are summarized in
the Table 1.</p>
        <p>Corpus
GRAC
KTUM
Parallel
UkrRus</p>
        <p>
          Size
437
million
100
million
9,3
million
In this section we used some of the material cited in the paper [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Corpora that are
not available online are not included into our overview. In particular, we do not know
the actual size and functionality of the Ukrainian national linguistic corpus [31],
developed by the Ukrainian language-information foundation of the Ukrainian national
academy of sciences. It was published under restricted access and was not available to
us.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Contents and Architecture of the GRAC</title>
      <p>The GRAC is intended to encompass texts in (different forms of) the
standard-oriented written Ukrainian since the first texts in Modern Ukrainian to the present day.
This corpus is designed to enable a study of diachronic, diatopic and normative
variation of the standard-oriented language. Examples of corpora-based studies are [24]
(on possessive pronouns) or [22] (on syntactic variation).</p>
      <p>
        The latest version (GRAC v.7.0) was created in December 2019. The chronological
period covered by the corpus spans from 1816 to 2019. The size of the corpus is more
than 430 million tokens in more than 45 thousand texts of different genres created by
about 15 thousand individually known authors. The texts are either scanned and
recognized from printed sources or taken (often with additional OCR and copyediting)
from the Internet; the sites are listed on the website of the corpus. The corpus repre
sents different genres, styles and regions. For all the texts the respective dates of
creation are specified, many texts also have data concerning their publication. For the
first time a subcorpus of Ukrainian diaspora is created (such texts had been largely
neglected in the Ukrainian corpus linguistics, whereas they are crucial for studying
normative variation, cf. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [29]).
      </p>
      <p>The corpus is currently designed to include prose, but poetic texts are also planned
to be included into its future expansion. As of early 2020, the Aeneid by
Kotliarevskyi, the first masterpiece of Modern Ukrainian, was the only poetic text to
be featured in the Corpus. The corpus currently has no specific annotation for poetic
texts (metrical structure etc.)</p>
      <p>Translations are also included into the corpus as they have played a significant role
in the development of the standard language.</p>
      <p>About 50% of the texts belong to the domain of fiction.</p>
      <p>Among non-fiction a large subcorpus of journalism is to be pointed out that in
cludes collections of newspapers of 1888-1893 (Dilo, Ruslan, Narod, Červona
Ukrajina, Bukovyna, Narodna Časopys’), 1905 (Xliborob), 1913-1918 (Dilo, Ruslan, Djilo
i Nove Slovo, Vil’na Ukrajina, Vistnyk Sojuza vyzvolennja Ukrajiny, Krakivs’ki visty,
L’vivs’ki visti, Vistnyk polityky, literatury i žyttja), 1919-1943 (Strilec’, Šljax do voli,
Visti VUCVK, Dilo, Meta, Novyj čas, Svoboda, Ukrajins’kyj Beskyd, Vil’na Ukrajina,
Červona Ukrajina, Krakivs’ki visty, Červonyj Peremyšl’, L’vivs’ki visti, Ukrajins’ki
ščodenni visty, Ukrajins’kyj Visnyk, Holos Pidkarpattja), contemporary newspapers
from different regions (Ukrajina moloda, Vysokyj zamok, Slovo, Visnyk SNAU,
Kryms’ka svitlycja, Naš den’, Čornomorec’, Nyva, Šalom Alejxem, Vinnyčyna,
Svoboda, Učytel’, Visnyk odes’koji advokatury, Visti Donbasu, Vpered, Krajeznavstvo
Zaporižžja, Licejist, Nasha hazeta [Novodnistrovs’k], Smiljanochka, Spivdružnist’,
21-j kanal, 7 dniv, Volyns’ki novyny, Vorskla et al.) as well as texts from news sites in
the web (as such editions sometimes use machine translation for translating news,
only sites that have the Ukrainian version as the single one were used as sources).
Another large subcorpus consists of academic and educational texts: monographs,
dissertations, scholarly papers, textbooks. There exists a separate subcorpus of religious
texts, including among others two Ukrainian translations of the Bible. The subcorpus
of ego texts also features memoires, letters and diaries, including a considerable cor
pus of Facebook posts representing blogs of people from all the Ukrainian regions and
from the diaspora. Also included are subcorpora of spoken genres, viz. speeches and
interviews.</p>
      <p>The GRAC includes also some dictionaries featuring phrasal examples and idioms,
among others Dictionary of Ukrainian by B. Hrinchenko and Russian-Ukrainian
Dictionary of Idioms by I. Vyrhan and M. Pylynska. Using the corpus instruments the
dictionaries can be searched not only by lexemes but also by lexico-grammatical
patterns used in the examples and cited idioms.</p>
      <p>The majority of texts come from printed sources. There are also smaller subcor
pora of Internet texts (news, Facebook), visual media texts (translations and subtitles
for movies and TV shows) and family documents (correspondence and diaries).</p>
      <p>The family letters (about 800 texts) are collected by the students of the Lviv
Polytechnical University advised by Professor Olena Levchenko. The texts are tran
scribed; of all the texts included into the GRAC, they depart in the largest extent from
the standard-oriented language. The original pictures of these letters are also available
within the corpus.</p>
    </sec>
    <sec id="sec-4">
      <title>Regional Annotation</title>
      <p>
        Worldwide, there exist different large corpora of regional and/or national linguistic
variations. Examples include corpora of New World Englishes (including the corpus
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], developed since 1988). The corpus of Global Web-based English [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] has about 2
billion tokens. The regional and/or national linguistic variants are discussed e. g. in
[21] or in the more recent [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Corpora-based research of lexicon and grammatical
categories in different regional English varieties is conducted, eg of the Perfect gram
(in the volume [33]).
      </p>
      <p>
        A similar corpus system is built by Mark Davies for Spanish [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The corpus
includes 2 billion tokens from 21 Spanish-speaking country and the United States.
      </p>
      <p>
        Within the Russian National corpus a subcorpus of foreign press is including
featuring a collection of Hrodna Region newspapers in Russian and Belarusian. There
exist also corpus-based research of lexical and grammatical characteristics [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. A
corpus of post-1991 Russian texts of Ukraine is also created and some pilot studies
performed [23].
      </p>
      <p>
        Russian linguists have studied the Russian language with regard to the regional
variety using the bulk of the Internet texts and the functionality of searching by regions
(especially in blogs) of the Yandex and Google search engines [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. These search
tools are not a dedicated linguistic tool lacking capacities for linguistic search and do
not yield the exact quantitative information necessary for a statistical research. Never
theless using the Internet for searching regionally marked linguistic data has evident
advantages such as large textual database, geographical and stylistical diversity, speed
and easiness of search, sometimes presence of regional, chronological and authorship
information.
      </p>
      <p>
        These corpora and research are designed for pluricentric languages, that is the ones
that function in different states where local linguistic norms, and, ultimately, local
language variations are formed. Modern Ukrainian is not pluricentric sensu stricto,
although it was formed historically in different centres and many local differences
(between the East, the West and the diaspora) are still present today [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] . Hence, in
the research of the variation of the Ukrainian language several approaches and meth
ods evolved in the study of pluricentric languages can be applied [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>The regional markup of the corpus is based on the contemporary administrative
structure of Ukraine. These boundaries as a pure convention that does not suggest any
correlation between the regional (standard-oriented) variation and the dialectal
boundaries. The texts of all the oblasts of Ukraine and from Crimea are represented in the
corpus. The regions are grouped into six macroregions. Nearly a half of the texts in
the corpora have regional markup.</p>
      <p>Texts belong to the region were the author (or the translator, for a translated text)
was born, studied and/or lived for more than ten year. The media texts are marked by
the region where the respective media appeared. A single text can belong to different
regional subcorpora (if the author or the translator was born, studied or lived for a
long time in different regions). Alongside with regional subcorpora, there are
subcorpora of diaspora (the United States, Canada, Poland, Germany, the United Kingdom,
France etc.).</p>
    </sec>
    <sec id="sec-5">
      <title>Metatextual Annotation</title>
      <p>All the texts in the corpus are annotated by the year when they were written or by the
last year when the text could be possibly created. The translated texts are marked by
the year when the translation was made. The date of creation is the main date of a
given text within the corpus. According to this date the text falls into the respective
chronological subcorpus and counts in statistical research. Additionally, the date of
the edition, used in the corpus, can be specified.</p>
      <p>The corpus features information (if available) on the authors: year of birth, gender,
and region(s), where the author was born, studied at a university and/or lived for more
for than ten years.</p>
      <p>The corpus contains four types of media: newspapers, magazines, TV channels and
news sites. At the search page users can specify either the name of an edition or its
type.</p>
      <p>Each edition has information on the region where it appears (or used to appear).
Since the information on the authors in media is often inaccessible, the regional
affiliation of these texts is tagged according to the place of publication rather than to the
author.
6
7</p>
    </sec>
    <sec id="sec-6">
      <title>Translations</title>
      <p>Nearly a quarter of the corpus texts are translations. The corpus includes translations
from 69 languages, the most popular source languages being English and Russian. It
is worth mentioning that sometimes Ukrainian translations are rendered from a
Russian version rather from the language of the original text, and many editions do not
specify this fact. When it is known that Russian served as intermediate language, this
is specified in the corpus. Nevertheless, not all the translated texts within the corpus
were studied with regard to a possible influence of an intermediate version.</p>
    </sec>
    <sec id="sec-7">
      <title>Morphological Annotation</title>
      <p>The GRAC morphology is based on the system of morphological analysis
developed by the specialists of the group r2u (Andriy Rysin, Vasyl Starko and others). The
system is based on the VESUM dictionary [28] and available for non-commercial
use.</p>
      <p>The program analyzes the text and for each token defines lemma and tags
(grammatical markers), with ambiguity partially resolved by rule-based algorithms (see [28] for
further details). An analyzed word searched in the corpus has the following format:
wordform /|lemma|/|tag1:tag2:tag3…|
The phrase Він поспішав писати / Vin pospišav pysaty ‘He wrote in a hurry’ has the
following word-by-word annotation:
Він /|він|/|noun:m:v_naz:&amp;pron:pers:3|
поспішав /|поспішати|/|verb:imperf:past:m|
писати /|писати|/|verb:imperf:inf|</p>
      <p>Thus it is possible to search by token, by lemma or by tag, and by different
combinations of these.</p>
      <p>The lemmas are marked only for the words present in the dictionary VESUM [28].
Other words can be searched only by token.</p>
      <p>The full list of tags is available at the site of the Ukrainian Brown Corpus project
[25] https://github.com/brown-uk/dict_uk/blob/master/doc/tags.txt
8</p>
    </sec>
    <sec id="sec-8">
      <title>Orthography</title>
      <p>The addition of the old texts into the corpus implies the solution of certain problems,
including “correction” of the old texts in newer editions (not limited to orthography)
and different orthographies in older editions. The majority of texts are included into
the corpus according to modern or Soviet editions. This is shown in the metadata of
the text (if known); while working with such texts one should keep in mind that they
could have been altered. When it is certain that the editors did interfere, the date of
the version is shown after the name of the text, while the main date of this text is still
the creation date, e. g.: Dmytro Buz’ko, Ljolja [version 2016-2018], 1924. A minority
of the texts dating back to the 19th century or to the beginning of the 20th is given in
the corpus according to the older editions, the orthography being kept.</p>
      <p>The GRAC contains texts in Skrypnykivka and Zhelykhivka, and also some texts
in Yaryzhka (Russian-based orthography), such as the oldest text in the corpus
(1816).</p>
      <p>The texts in Zhelekhivka are currently only partly morphologically analyzed. The
program lemmatizes correctly:
1. Orthography of the type "називати ся" (with reflexive particle written separately
only in immediate postposition, cf. in the modern orthography називатися /
nazyvatysja ‘to be called’)
2. Orthography of the type "цїлком" (with ї after consonants, reflecting the Western</p>
      <p>Ukrainian dialectal vocalism, cf. цілком / cilkom ‘as a whole’)
3. Orthography of the type "мякий" (without an apostrophe, cf. м’який / m’jakyj
‘soft’):
4. Orthography of the type "сьвіт" (with a soft sign marking the regressive palataliza
tion, cf. світ / svit ‘world’)
5. Other cases that do not correspond to the modern orthography like "моглиб"
(without separated subjunctive particle, cf. могли б / mohly b ‘(plural) would be
able’), "жити меш" (without separated futural auxiliary, cf. житимеш žytymeš
‘you will live’) and others are not recognized by GRAC.v.7, they do not have
lemmas and can be found only by exact search.</p>
    </sec>
    <sec id="sec-9">
      <title>Search Query</title>
      <p>
        The GRAC search query is based on the NoSketchEngine platform [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The program
enables search by lemma, word form and grammatical tags. Complex search queries
can be built using a CQL-based query language. A user can specify text filtering (only
texts of a given period, author, region, only translations from a given language etc.). It
is possible to customize and save personal subcorpora with any set of textual features
included into the annotation.
      </p>
      <p>In Table 2, the metatextual attributes and the bulk of tags are presented (for the
tags concerning source languages and regional markup, see the lists on the GRAC
site). For the search interface see Fig. 1.</p>
      <sec id="sec-9-1">
        <title>Source lan- DOC.ORIGINAL</title>
        <p>guage
Date
Publication
date
Orthography</p>
      </sec>
      <sec id="sec-9-2">
        <title>Author DOC.AUTHOR</title>
        <p>Translator DOC.TRANSLATOR;
DOC.AUTHTRANS for
searching both original and
translated texts by the same
writer
Birth date of DOC.BORN
the author
Gender DOC.SEX</p>
      </sec>
      <sec id="sec-9-3">
        <title>Edition</title>
        <p>Media type</p>
      </sec>
      <sec id="sec-9-4">
        <title>DOC.MEDIANAME</title>
        <p>DOC.MEDIATYPE
ART — art
BIO — biology
CHE — chemistry
ECN — economics
ETH — ethnography
FMA — physics and mathematics
GEO — geography
HIS — history
IT — information technologies
JUR — law
MED — medicine
MIL — military
PED — pedagogy
PHL — literature and linguistics
PHS — philosophy
POL — political science
PSY — psychology
REZ — religion studies</p>
      </sec>
      <sec id="sec-9-5">
        <title>CONT — modern orthography ZHEL — Zhelekhivka SKRY — Skrypnykivka</title>
      </sec>
      <sec id="sec-9-6">
        <title>M — male F — female</title>
      </sec>
      <sec id="sec-9-7">
        <title>MAGAZINE</title>
        <p>NEWSPAPER
TV_CHANNEL
WEBSITE
Region</p>
      </sec>
      <sec id="sec-9-8">
        <title>DOC.LOCCODE DOC.COUNTRY DOC.MACROREGION DOC.REGION</title>
        <p>Source type</p>
      </sec>
      <sec id="sec-9-9">
        <title>DOC.SOURCE</title>
      </sec>
      <sec id="sec-9-10">
        <title>PRI — printed source</title>
      </sec>
      <sec id="sec-9-11">
        <title>WEB — Internet FAM — family archive TEL — TV</title>
        <p>The search results are visualized as a concordance (see Fig. 2). The parameters of
viewing the concordance can be additionally configured. The concordance includes
contexts where the searched linguistic phenomenon is found (the contexts can be, if
needed, expanded by one more sentence leftwards and rightwards by clicking on the
key word) and the information on the respective sources. Users can customize the set
of the information on the source that is visualized in the concordance. The full
information on any text can be shown by clicking the row with metadata. The concordance
can be sorted by any attribute, for example the year of creation (DOC.DATE). The
concordance, configured in the way, may be downloaded as a table/database for
further treatment.</p>
        <p>The results of the search can be used to generate automatically frequency lists by
different attributes (word form, lemma, tags, left and right context etc), frequency lists
are also available for downloading. Frequency dictionaries can be generated for any
subcorpus.</p>
        <p>The example in Fig.3 shows the upper part of the frequency list for the
combination of the verb краяти / krajaty ‘cut’ with different nouns in accusative (‘heart’,
‘bread’, ‘soul’, ‘air’, ‘ground’, ‘silence’, ‘meat’).</p>
        <p>The GRAC has an additional option for building frequency plots. Several types of
plots are supported (developed by Tymofij Nikolajenko).
 A frequency plot (according to any CQL query) that is build by instances per
million (ipm) with regard to years. The example in Fig. 4 shows that the frequency of
the variant в Украïнi / v Ukrajini ‘in Ukraine’, as it is perceived to describe a
country rather than a region, increases frequency after the proclamation of the
Ukrainian independence in 1991, whereas the на Украïнi / na Ukrajini variant
decreases.</p>
        <p>The data (used to plot ipm with regard to years) for all traces are indicated as tables.
 A plot for the ratio between frequencies of more than one linguistic phenomena
with regard to years. An illustration in Fig. 5 shows the distribution of the synony
mous lexemes слухавка / sluxavka and трубка / trubka ‘handset of a telephone’.
The latter variant, as it is close to the Russian word, has been declining in
frequency since 1990s.
All the plots may be filtered by different subcorpora, including regional ones. We
may cite frequency plots showing diachronic distribution of two linguistic units in dif
ferent regional subcorpora. Fig. 6 shows percentage of the variants of the preposition
‘from’ in the central region of Ukraine and the analogical distribution distribution for
the western region. In the “central” texts, until 1930s, there is a competition between
the variants од / od (above on both plots, brighter color) and вiд / vid (below on both
plots, darker color), with the variant вiд / vid taking over as the modern standard since
1940s. In the “western” texts од / od is very rare throughout (and gains some
momentum only in 1970s as an experiment influenced by the older “central” norm, perceived
as archaic).
 yet another type of plots indicates the total number of words in all documents for a
particular year (the so-called 'norm').
10</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Conclusion</title>
      <p>The corpus that has so detailed markup combined with modern tools for search and
processing of results, is a new tool in Ukrainian studies that gives the opportunity to
raise new research questions concerning the history of development of modern written
Ukrainian, its regional variability, regional norms and standardization.</p>
      <p>Future plans of its development include analyzing the current composition of the
corpus and defining the representativeness gaps for eventual readjustment. As the
corpus was primarily compiled using the texts easily available at the moment, it has some
gaps with regard to its representativeness in such, for example the Soviet texts of
1950s-1980s, political/propagandistic and academical discourse alike. They are
seldom read and digitalized, whereas they reflected many tendencies in the language (not
only the imperial Russification as it is often stated) and served as sources for more
modern developments. The instruments for visualization of search results are to be
improved.</p>
      <p>Special subcorpora and specific markup for them is to be further developed,
including poetic corpora, parallel and perhaps spoken texts.</p>
      <p>Directions of future evolution for GRAC include genre diversity, more detailed
annotation (including semantic and possibly syntactic markup) and integration of tools
for text processing.</p>
    </sec>
    <sec id="sec-11">
      <title>Acknowledgements</title>
      <p>We thank Orest Drul, Maksym Bystryckyi, Natalia Mikhailivska, Mykola Zharkykh,
and everybody who creates and develops online libraries, Olena Levchenko and the
Politekhnika students for prepared texts, especially for 800 letters from family
archives, Vasyl Starko and the UCU students that prepared for the GRAC a collection
of academic texts of 1995-2016 from the electronic library of the National Academy,
Solomiia Buk and the Ivan Franko National University of Lviv students, Olena
Dotsenko and the Borys Hrinchenko Kyiv University students for prepared texts, the
Tempora publishing house for the interwar Ukrainian texts, Olena Yavorska and the
State Literary Museum (Odesa), Mykhailo Nazarenko, Artem Fedorinchyk, Taras
Shmiger, Olha Sakharova, Oksana Taran, Volodymyr Hutorov and other partners that
afford texts for the corpus.
21. Schmied, J.: Corpus linguistics and non-native varieties of English. World Englishes 9 (3),
255–268 (1990).
22. Švedova, M. O.: Dynamika syntaksyčnyx konstrukcij na poznačennja šljaxu pry dijeslovax
ruxu: korpusne doslidžennja. Ukrajinsʹka mova 3, 67–79 (2018). [Shvedova, M. O.:
Dynamics of syntactic constructions for marking path with motion verbs: a corpus-based
study. Ukrainian Language 3, 67–79 (2018).]
23. Švedova, M. O.: Korpusni metody doslidžennja rehionalʹnyx vidminnostej u mežax
odnijeji movy (na materiali rehionalʹnyx korpusiv ukrajinsʹkoji ta rosijsʹkoji mov). Visnyk
Xarkivsʹkoho nacionalʹnoho universytetu im. V. N. Karazina. Serija: Filolohija 77, 33–38
(2017) [Shvedova, M. O.: Corpus studies of regional differences within a language (based
on the material of regional corpora of Ukrainian and Russian languages). Bulletin of V. N.</p>
      <p>Karazin Kharkiv National University. Series: Philology 77, 33–38 (2017)]
24. Švedova, M. O.: Stanovlennja prysvijnoho zajmennyka tretʹoji osoby množyny v
ukrajinsʹkij literaturnij movi. Movoznavstvo 4, 40–53 (2018). [Shvedova, M. O.: Development of
the possessive third person singular in Modern Ukrainian. Movoznavstvo/Linguistics 4,
40–53 (2018)]
25. Shvedova, M., von Waldenfels, R., Yarygin, S., Kruk, M., Rysin, A., Starko, V., Woźniak,
M.: GRAC: General Regionally Annotated Corpus of Ukrainian, uacorpus.org, last
accessed 2020/04/12.
26. Sičinava, D. V.: Parallel’nye teksty v sostave Nacional’nogo korpusa russkogo jazyka:
novye napravlenija razvitija i rezultaty. Trudy Instituta russkogo jazyka RAN 6, 194—235
(2015). [Sitchinava, D. V.: Parallel texts within the Russian National Corpus: new direc
tions and results // Proceedings of the Russian language institute 6, 194-235 (2015)]
27. Starko, V.: Building of the Brown Ukrainian Corpus. Movni i kontseptual'ni kartyny svitu
[Linguistic and Conceptual Weltbilder] 48, 415-421 (2014).
28. Starko, V., Rysin, A.: Velykyj elektronnyj slovnyk ukrajinsʹkoji movy (VESUM) jak zasib
NLP dlja ukrajinsʹkoji movy (u druci) [Starko Vasyl (Lutsk, Ukraine), Rysin Andrew
(Cary, USA). The Great Electronic Dictionary of the Ukrainian Language (VESUM) as a
NLP Tool for the Ukrainian Language (forthcoming)]
29. Taranenko, O. O.: Mova ukrajinsʹkoji zaxidnoji diaspory i sučasna movna sytuacija v
Ukrajini (na zahalʹnoslov‘jansʹkomu tli). Movoznavstvo 2-3, 63-99 (2013) [Taranenko, O.
O.: The Language of the Ukrainian Western Diaspora and the Current Linguistic Situation
in Ukraine (against the Slavic Background). Movoznavstvo/Linguistics 2-3, 63-99 (2013).]
30. UberText, https://lang.org.ua/uk/corpora/, last accessed 2020/04/12.
31. Ukrainian national linguistic corpus, http://unlc.icybcluster.org.ua/virt_unlc/, last accessed
2020/04/12.
32. Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V., Nagy, V.: Parallel corpora for
medium density languages. In.: G. Angelova et al. (eds.) Proceedings of the RANLP 2005,
pp. 590-596, INCOMA, Shoumen (Bulgaria) (2005).
33. Werner, V., Seoane, E. (eds): Re-Assessing the Present Perfect: Corpus Studies and
Beyond. Mouton de Gruyter, Berlin (2016).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ažnjuk</surname>
            ,
            <given-names>B. M.</given-names>
          </string-name>
          :
          <article-title>Movna jednistʹ natsiji: diaspora i Ukrajina. Ridna mova</article-title>
          ,
          <source>Kyjiv</source>
          (
          <year>1999</year>
          ). [
          <string-name>
            <surname>Azhniuk</surname>
            ,
            <given-names>B. M.:</given-names>
          </string-name>
          <article-title>The Language Unity of the Nation: Diaspora and Ukraine</article-title>
          . Ridna mova,
          <source>Kyiv</source>
          (
          <year>1999</year>
          )]
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Brown corpus of Ukrainian, https://github.com/brown-uk/corpus, last accessed
          <year>2020</year>
          /04/12.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Corpora Collection of the Leipzig University, http://corpora.informatik.uni-leipzig.de/de? corpusId=ukr_mixed_2014, last accessed
          <year>2020</year>
          /04/12.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Corpus del Español</surname>
          </string-name>
          : Web/Dialects, https://www.corpusdelespanol.org/web-dial/,
          <source>last accessed</source>
          <year>2020</year>
          /04/12.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <source>Corpus of Global Web-based English (GloWBE)</source>
          , https://www.english-corpora.org/ glowbe/,
          <source>last accessed</source>
          <year>2020</year>
          /04/12.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Corpus of Spoken Rusyn, http://parasolcorpus.org/Varchola1/login.php,
          <source>last accessed</source>
          <year>2020</year>
          /04/12.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <article-title>Corpus of the Chtyvo library</article-title>
          , http://korpus.org.ua/,
          <source>last accessed</source>
          <year>2020</year>
          /04/12.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Danylenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>How Many Varieties of Standard Ukrainian Does One Need</article-title>
          ? Die
          <string-name>
            <surname>Welt der Slaven</surname>
            <given-names>LX</given-names>
          </string-name>
          ,
          <fpage>223</fpage>
          -
          <lpage>247</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Darčuk</surname>
            ,
            <given-names>N. P.:</given-names>
          </string-name>
          <article-title>Doslidnycʹkyj korpus ukrajinsʹkoji movy: osnovni zasady i perspektyvy. Visnyk KNU im</article-title>
          .
          <source>Tarasa Ševčenka. Serija: Literaturoznavstvo. Movoznavstvo. Folʹklorystyka 21</source>
          ,
          <fpage>45</fpage>
          -
          <lpage>49</lpage>
          (
          <year>2010</year>
          ). [Darchuk,
          <string-name>
            <surname>N. P.</surname>
          </string-name>
          :
          <article-title>Research corpus of the Ukrainian language: ba - sic principles and perspectives</article-title>
          . Bulletin of Kyiv Shevchenko University. Series: Literary Studies.
          <source>Linguistics. Folklore</source>
          <volume>21</volume>
          ,
          <fpage>45</fpage>
          -
          <lpage>49</lpage>
          (
          <year>2010</year>
          )]
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Fokin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Korpusy tekstiv: zdobutky Ukrajiny ta perspektyvy vraxuvannya zakordonnoho dosvidu</article-title>
          .
          <source>Visnyk KNU im. Tarasa Ševčenka. Serija: Literaturoznavstvo. Movoznavstvo. Folʹklorystyka 28</source>
          ,
          <fpage>51</fpage>
          -
          <lpage>54</lpage>
          (
          <year>2018</year>
          )
          <article-title>[Fokin S. Corpus of texts: achievements of Ukraine and prospects of taking into account foreign experience</article-title>
          // Bulletin of Taras Shevchenko National University of Kyiv.
          <source>Series: Literary Studies. Linguistics</source>
          <volume>28</volume>
          ,
          <fpage>51</fpage>
          -
          <lpage>54</lpage>
          (
          <year>2018</year>
          )]
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Gritsenko</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Je</surname>
          </string-name>
          .
          <article-title>Nekotoryje zamečaniya o dialektnoj osnove ukrainskogo literaturnogo jazyka</article-title>
          . In.: Toporov,
          <string-name>
            <surname>V. N.</surname>
          </string-name>
          <article-title>(ed.) Philologia slavica: K 70-</article-title>
          letiju
          <string-name>
            <given-names>akademika N. I.</given-names>
            <surname>Tolstogo</surname>
          </string-name>
          . S.
          <volume>284</volume>
          -
          <fpage>294</fpage>
          . Nauka,
          <string-name>
            <surname>Moskva</surname>
          </string-name>
          (
          <year>1993</year>
          ). [
          <string-name>
            <surname>Gritsenko P. E.</surname>
          </string-name>
          <article-title>Some remarks on the dialect basis of the Ukrainian literary language</article-title>
          . In.: Toporov,
          <string-name>
            <surname>V. N.</surname>
          </string-name>
          <article-title>(ed.) Philologia slavica: Papers presented to Nikita Tolstoy on his 70th anniversary</article-title>
          . Nauka, Moscow, pp.
          <fpage>284</fpage>
          -
          <lpage>294</lpage>
          . (
          <year>1993</year>
          )]
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. International Corpus of English. http://ice-corpora.net/ice/index.htm,
          <source>last accessed</source>
          <year>2020</year>
          /04/12.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. KTUM Corpus of Ukrainian texts, http://www.mova.info/corpus.aspx,
          <source>last accessed</source>
          <year>2020</year>
          /04/12.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Laboratorija</surname>
          </string-name>
          <article-title>Ukrajins'koji</article-title>
          , https://mova.institute, last accessed
          <year>2020</year>
          /04/12.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Mair</surname>
          </string-name>
          , С.:
          <article-title>World Englishes and Corpora</article-title>
          . In.: Filppula,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Klemola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.)
          <source>The Oxford Handbook of World Englishes</source>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>122</lpage>
          . Oxford University press, Oxford (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Matvijas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Dialektna osnova ukrajinsʹkoji literaturnoji movy</article-title>
          .
          <source>Movoznavstvo</source>
          <volume>6</volume>
          ,
          <fpage>26</fpage>
          -
          <lpage>36</lpage>
          (
          <year>2007</year>
          ). [
          <string-name>
            <surname>Matviyas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>The dialectal basis of the Ukrainian literary language</article-title>
          .
          <source>Movoznavstvo/Linguistics. 6</source>
          ,
          <fpage>26</fpage>
          -
          <lpage>36</lpage>
          (
          <year>2007</year>
          )]
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Russian-Ukrainian</surname>
          </string-name>
          and
          <article-title>Ukrainian-Russian parallel subcorpora of the RNC</article-title>
          , http://www.ruscorpora.ru/search-para-uk.html,
          <source>last accessed</source>
          <year>2020</year>
          /04/12.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <article-title>Russkij jazyk i novyje texnologii. Novoje literaturnoje obozrenije</article-title>
          ,
          <source>Moskva</source>
          (
          <year>2014</year>
          ).
          <article-title>[Russian language and new technologies</article-title>
          . New Literary Review, Moscow (
          <year>2014</year>
          ).]
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Rychlý</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Manatee/
          <string-name>
            <surname>Bonito-A Modular Corpus</surname>
          </string-name>
          <article-title>Manager</article-title>
          . In.: Sojka P.,
          <string-name>
            <surname>Horák</surname>
            <given-names>A</given-names>
          </string-name>
          . (eds.)
          <source>First Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2007</source>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>70</lpage>
          . Masaryk University, Brno (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Savčuk</surname>
            ,
            <given-names>S. O.</given-names>
          </string-name>
          :
          <article-title>Korpus kak instrument dlja issledovanija osobennostej funkcionirovanija russkogo jazyka v regional'noj presse</article-title>
          .
          <source>Trudy Instituta russkogo jazyka RAN 6</source>
          ,
          <fpage>333</fpage>
          -
          <lpage>365</lpage>
          (
          <year>2015</year>
          ). [
          <string-name>
            <surname>Savchuk</surname>
            <given-names>S. O.</given-names>
          </string-name>
          <article-title>Corpus as a tool for studying the features of the functioning of the Russian language in the regional press</article-title>
          .
          <source>Proceedings of the Russian Language Institute</source>
          <volume>6</volume>
          ,
          <fpage>333</fpage>
          -
          <lpage>365</lpage>
          (
          <year>2015</year>
          )]
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>