=Paper= {{Paper |id=Vol-1607/arckhangelskiy |storemode=property |title=Developing Morphologically Annotated Corpora for Minority Languages of Russia |pdfUrl=https://ceur-ws.org/Vol-1607/arckhangelskiy.pdf |volume=Vol-1607 |authors=Timofey Arkhangelskiy,Maria Medvedeva |dblpUrl=https://dblp.org/rec/conf/clif/ArkhangelskiyM16 }} ==Developing Morphologically Annotated Corpora for Minority Languages of Russia== https://ceur-ws.org/Vol-1607/arckhangelskiy.pdf
Developing morphologically annotated corpora for minority languages of
                               Russia



             Timofey Arkhangelskiy                                       Maria Medvedeva
  National Research University Higher School                          University of Groningen
                of Economics

         tarkhangelskiy@hse.ru                                        medvmr@gmail.com




                                                           But, precisely because of their remoteness and
                    Abstract                               poorer accessibility, there are no corpora for most
                                                           such languages, while there are a multitude of
   Despite recent progress in developing
                                                           quality corpora for most European languages.
   annotated corpora for minority languages of
                                                           In this paper we speak about developing corpora for
   Russia, still only about a dozen out of about
                                                           minority languages of Russia. Despite their genetic
   100 have comprehensive corpora, and even
                                                           diversity, these languages are similar in several
   less have computational tools such as
                                                           respects, which makes certain approaches
   machine translation systems or speech
                                                           applicable to all or most of them. We will focus
   recognition modules. However, given that
                                                           primarily on the cases of Udmurt and Adyghe and
   many of them have resources such as
                                                           show that solutions we used in their development
   dictionaries and grammars, the situation can
                                                           can be employed for creating corpora of other
   be improved at relatively low cost. In the
                                                           languages of Russia at a reasonably low cost. The
   paper we demonstrate the pipeline that can
                                                           Udmurt corpus was first released in 2014 and is
   be used for developing such corpora,
                                                           available          at         http://web-corpora.net/
   featuring the development of Udmurt and
                                                           UdmurtCorpus/search . Adyghe corpus is currently
   Adyghe corpora. The methods we describe
                                                           under development. The pilot version of the corpus
   are in principle applicable to any language
                                                           currently has restricted access, but it is expected to
   for which certain kind of linguistic
                                                           be released later in 2016 at the same portal.
   resources are available.
                                                           2. Languages of Russia and their corpora
1. Introduction
                                                           There are 93 living indigenous minority languages
Language corpora are one of the primary                    spoken in Russia, according to Ethnologue (Lewis
instruments of research in contemporary linguistics.       et al., 2016; the number should not be seen as
Corpora allow researchers from all over the world          precise because of the language vs. dialect
to analyze raw language data rather than its               uncertainty). All or almost all of them share several
interpretations by other linguists in grammars and         features important for corpus linguistics.
articles. Compiling publicly available corpora is          First, vast majority of them are written and have
particularly important for more ‘remote’ languages         official orthography, which, with the exception of a
most researchers have restricted physical access to.



                                                       1
handful of Finnic languages, is based on Cyrillic                 many kinds of research, especially if the research
alphabet. As virtually all of these orthographies                 involves statistics. In this paper, we focus on larger,
were developed in the 1930s or later, they represent              automatically annotated (and mostly written)
the phonology in a pretty straightforward fashion,                corpora, which are more suitable for low-cost
unlike in English, Russian or other languages with                development. To the best of our knowledge, such
long written tradition. Having been developed by                  large corpora have been released for the following
professional linguists, these orthographies faithfully            13 minority languages of Russia belonging to five
reflect all phonological distinctions. On the level of            language families:
lexicon, these languages share numerous loanwords                  • East Caucasian: Avar1, Dargwa2, Lezgian3;
from Russian. On the level of grammar, all these                   • Indo-European: Ossetic4, Romani5;
languages are morphologically rich, having on                      • Mongolic: Buryat 6 (Badmaeva 2015),
average      more      morphologically      expressed                  Kalmyk7;
grammatical categories than Standard European                      • Turkic: Tatar 8 (Suleymanov et al. 2011),
languages. What this implies is that in order to be                    Bashkir (Buskunbaeva, Sirazitdinov 2011),
useful for a wide range of linguistic research, their                  Khakas9 (Sheymovich 2011);
corpora should have full morphological tagging                     • Uralic: Udmurt, Mari 10 (Bradley 2015),
including all morphological categories, rather than                    Komi11.
mere POS-tagging. Fine-grained morphological                      Apart from those, there are several ongoing corpus
annotation is also essential for developing ulterior              development projects that we know of, including the
levels of annotation, such as syntactic parsing or                Adyghe corpus project. There are also reports on
anaphora resolution, in morphologically rich                      developing corpora for Chuvash (Zheltov 2015),
languages (see e.g. Goldberg and Elhadad, 2013 on                 Tuva (Salchak, Bayirool 2013) and Yakut
syntactic parsing of Hebrew).                                     (Leontyev 2014), but the status of these projects is
However, what is more important, is that quality                  unclear.
linguistic resources have been created for these
languages. Virtually all of them have grammars and                3. The pipeline of corpus development
many have extensive bilingual (usually X-to-
                                                                  3.1. Collecting the texts
Russian) dictionaries. These resources, as we will
show, can be transformed into taggers relatively                  Books and other printed materials exist for most of
easily, and thus are crucial for low-cost corpus                  the languages in question, but the cost of scanning,
development.                                                      OCR and proofreading sufficient amount of texts is
Existing corpora of minority languages of Russia                  prohibitive for a low-budget corpus project. The
can be split into two groups: relatively small (almost            only way to obtain a sufficiently large text
always under 100,000 tokens) manually annotated                   collection at low cost is therefore the Internet.
collections, mainly containing spoken texts, and                  Unfortunately, this constraint makes it impossible to
larger ones (at least several hundred thousand                    build corpora for the small and critically endangered
tokens, and usually more than one million) with                   languages that have very low digital vitality, in
automatic annotation. Numerous corpora of the                     terms of Kornai (2013). However, it seems that
former kind have been collected for various                       more than one third of the languages in question are
languages and dialects in linguistic expeditions                  to some extent represented on the web. According
since the 1960s. However, their size, which is                    to the estimates of Zaydelman et al. (2016), 30 to 40
naturally constrained by the amount of time and                   languages of Russia have visible amount of texts on
money required for their collection, is too small for             the Internet. The overall size of available texts

1                                                                 6
  http://web-corpora.net/AvarCorpus/search/                         http://web-corpora.net/BuryatCorpus/search/
2                                                                 7
  http://dag-languages.org/DargwaCorpus/search/                     http://web-corpora.net/KalmykCorpus/search/
3                                                                 8
  http://dag-languages.org/LezgianCorpus/search/                    http://web-corpora.net/TatarCorpus/search/
4                                                                 9
  http://corpus.ossetic-studies.org/search/ (Iron dialect),         http://khakas.altaica.ru/texts/
                                                                  10
http://corpus-digor.ossetic-studies.org/search/ (Digor               http://corpus.mari-language.com/
                                                                  11
dialect)                                                             http://komicorpora.ru/
5
  http://web-corpora.net/RomaniCorpus/search/



                                                              2
varies between a couple of thousand and several              corpus’ and the approach presented here. While the
dozen million tokens. Our corpus of Udmurt, 13th             former aims at gathering vast amounts of data for
largest minority language and probably the most              NLP purposes, the objective of the latter is to collect
digitally well-represented Uralic minority language,         all available texts in a given language, as the size of
currently contains 7.3 million tokens, which covers          the collection is limited for minority languages.
the vast majority of all digitally available texts for       According to our estimate, there are less than 100
this language. The volumes of the available data are         web domains containing texts in Udmurt. This order
slowly, but steadily growing: according to the year          of magnitude allows for manual inspection of all
distribution of our texts, the growth rate is on             relevant web domains (probably with the exception
average 0.7 million tokens per year in 2011-2015.            of Tatar and Bashkir, the most digitally viable of all
The texts available on the Internet fall mainly into         these languages) and do not require extraordinary
one     of     the   following     groups:     digital       computational resources to process them.
newspapers/mass media, blogs/social media and                Another potential pitfall in this process, besides
Wikipedia articles. For the genre composition of the         poor balance, is low quality of texts on Wikipedia.
Udmurt corpus see Table 1. It can be seen that the           While for larger languages Wikipedia is often used
corpus is severely unbalanced, as the genre                  as a convenient and reliable source of linguistic
distribution is skewed in favor of press, followed by        data, Wikipedias in minority languages of Russia
blogs with less than 6%. Our survey of texts in other        often contain a substantial number of automatically
minority languages available online suggests that            generated and thus linguistically useless content,
the distribution is roughly the same for all these           which can be easily seen in their distorted frequency
languages (again, with possible exceptions of Tatar          lists (Orekhov and Reshetnikov, 2014). If
and Bashkir). Lack of balance, which is inevitable           Wikipedia articles are to be used at all, they should
in the proposed method of corpus development, is             be filtered (e.g. by length), after which normally
one of its largest downsides.                                only a small number of articles make it to the
                                                             corpus. The corresponding figure in Table 1 shows
  Genre                 Tokens       %                       the size of the Wikipedia subcorpus after filtering.
                        (millions)
                                                             3.2. Morphological tagging
  press                 6.64         90.56%
                                                             Given that the texts can be collected from the
  blogs                 0.42         5.71%                   Internet and that tokenization is not much of a
                                                             problem for minority languages of Russia,
  New Testament         0.13         1.73%                   development of a morphological tagger is the most
  Wikipedia articles    0.06         0.84%                   difficult step in corpus development. Both statistical
                                                             and dictionary-based taggers require substantial
  non-fiction           0.03         0.40%                   amount of manual labor if built from scratch. The
                                                             former have to be trained on sufficiently large
  poetry                0.03         0.40%                   manually annotated collections, while the latter
  fiction               0.02         0.36%                   require that a grammatical dictionary is compiled
                                                             manually. However, the bilingual dictionaries
  Total                 7.33                                 available for the languages of Russia make
      Table 1: Udmurt corpus genre composition               compilation of a grammatical dictionary a much
                                                             easier task. This fact, as well as the tradition of
Now, the corpus does not include Udmurt posts                grammatical description of Russian that was started
from vkontakte, the most popular social network in           by Zaliznyak (1977), is the reason why all corpora
Russia, which are estimated to contain more than             listed in section 2 use the dictionary-based
0.5 million tokens.                                          approach.
The resulting text collection resembles the corpora          The idea is to manually write a formalized
developed within ‘Web as corpus’ approach                    description of the morphology based on the
(Kilgarriff and Grefenstette, 2003). There is,               grammars, and then transform a bilingual dictionary
however, an important difference between ‘web as             into a grammatical dictionary. In the Udmurt and




                                                         3
Adyghe projects we used the UniParser format and             generates the base form and adds it to the list of stem
software for formalized description and tagging,             allomorphs in the grammatical dictionary.
which were also used for most other aforementioned           Apart from the challenge posed by part of speech
corpora (Arkhangelskiy et al., 2012). There are also         tags, the excessiveness of the information in the
plenty of alternatives, including PC-KIMMO                   dictionary can be an obstacle. One of its
(Antworth, 1992), used in the Tatar corpus tagger,           manifestations is abundance of synonymous
or giellatekno infrastructure (Moshagen et al.,              translation equivalents, usage examples and phrases
2013).                                                       in dictionary entries, which are usually not needed
The central problem in this step is the fact that            in the corpus and thus have to be cut out. In Adyghe,
bilingual dictionaries normally do not contain               for which several dictionaries were used as an input,
necessary grammatical information such as part of            this lead to especially long translations, since
speech or declension / vowel harmony type; they              different dictionaries used different synonyms for
have to be automatically restored. We combined               translating the same word. This issue was addressed
three approaches to address this issue.                      by passing the translation equivalent through a
First, the form of the lemma in some cases clearly           number of transformations. All secondary meanings
indicates its part of speech. In Udmurt, we tagged as        and comments were removed by cutting out
verbs all lemmata ending in -ɨnɨ or -anɨ (markers of         segments in parentheses, after semicolons and after
the infinitive). Manual check found that only one            colons if certain threshold length has been reached.
word, ǯɨnɨ ‘half’, was tagged incorrectly during this        In the case of Adyghe, the synonyms in translation
step.                                                        equivalents were rearranged in a decreasing
All other parts of speech, however, did not have any         frequency order (according to the data from Russian
markers that could be used as clues for part-of-             National Corpus), so that the most frequent
speech tagging. In Adyghe, a polysynthetic                   synonym appeared first and all the rest could be
language where bare stems are used as citation               easily deleted during the manual proofreading.
forms and parts of speech in general are not well            Another manifestation of this problem lies in the list
differentiated, this was impossible altogether. The          of words included in the dictionary. Apart from too
approach we used for these cases was using the tag           many (potential) Russian loanwords that will
given by a Russian tagger (specifically, mystem              probably never appear in a corpus, such as
(Segalovich, 2003)) to the first non-abbreviated             aerosyomka ‘aerial photography’, dictionaries for
word of the translation. This worked surprisingly            minority languages of Russia tend to include
well: in Udmurt, around 85% of these tags proved             absolutely       compositional     and      productive
to be correct. The wrong tags came primarily from            derivatives or word forms as separate entries. For
two sources. First, some of the translation                  both Udmurt and Adyghe, this involves, first and
equivalents in both Udmurt and Adyghe                        foremost, verbal derivation. In Udmurt, causative,
dictionaries had several possible analyses, e.g. in          detransitive and iterative forms of most verbs were
adjectives which are commonly used as                        included in the corpus, however only a handful of
(substantivized) nouns. Second, Udmurt has an                them have somewhat non-compositional meaning.
extensive (hundreds of items) inventory of                   If left as is, the tagger based on such a dictionary
ideophones, or imitative words that do not have              would give seemingly ambiguous results for words
Russian translation equivalents and are translated           containing these affixes. For example, the word
periphrastically, e.g. čʼɨš-čʼaš ‘about burning of wet       vera-lʼlʼa-z      speak-ITER-PST.3SG       will       be
wood’.                                                       ambiguously tagged as ITER.PST.3SG form of the
As the final approach we wrote some simple scripts.          verb veranɨ ‘speak’ and as PST.3SG form of the verb
In Udmurt, the only additional field needed for              veralʼlʼanɨ ‘speak (repeatedly)’. These words were
tagging beyond part of speech is the conjugation             removed from the dictionary with a script that
type, which is determined by the last vowel of the           searched for a marker of one of these categories and
stem (Winkler 2000: 45). In Adyghe, there is a               checked if the remaining part was listed in the
regular e/a alternation in stems of a certain kind           dictionary as a separate verbal stem. The situation is
(Arkadyev and Testelets, 2009). Whenever the                 somewhat more difficult in Adyghe. Adyghe is a
script sees an alternated stem in the lemma, it              polysynthetic language, which means that the stem
                                                             can attach numerous derivational affixes. While



                                                         4
most of these combinations are perfectly                     balance. Our ongoing experiments with OCRed
compositional, some are not, therefore manual                Udmurt books suggest that adding a simple ngram-
check of all such complex stems is required. Words           based postprocessor trained on corpus data may
involving non-compositional combinations of stems            significantly improve its quality, reduce the cost of
and derivational affixes in Adyghe corpus get two            proofreading and thus eventually lead to adding
levels of annotation, one for the original stem, the         books to the corpus. Finally, language models
other for the combination of the stem and the affix          trained on such corpora enable other NLP tools for
(Arkhangelskiy and Lander, 2016). The interfaces             these other under-resourced languages (as an
enables users to search either for all occurrences of        example, Yandex launched Udmurt-Russian
a given stem, or only those occurrences where it is          machine translation service in 2016, which uses
not part of a non-compositional combination.                 language model trained on the Udmurt corpus).
Finally, the dictionaries have to be extended                This, in turn, can lead to preserving and
manually by adding irregular words (mostly                   revitalization of the minority languages.
pronouns) and frequent regular words that were
absent due to scarcity of the source dictionary or           References
conversion errors. The Udmurt tagger, to which all           Antworth, E. L. 1992. Glossing text with the PC-
pronouns and no more than a hundred other frequent             KIMMO morphological parser. Computers and the
lexemes were added manually, currently covers                  Humanities, 26(5-6):389-398.
about 88% of the tokens in the corpus. Here is an
example of an entry from the resulting Udmurt                Arkadyev, P. and Testelets, Ya. 2009. O trekh
                                                               cheredovaniyakh v adygejskom yazyke [On three
dictionary:
                                                               alternations in the Adyghe language]. Ya. Testelets
                                                               (ed.), Aspekty polisintetizma [Aspects of
-lexeme                                                        polysynthesis]. 121-145.
 lex: кизьыны
 stem: киз.                                                  Arkhangelskiy, T., Belyaev, O. and Vydrin, A. 2012.
 gramm: V,I                                                    The creation of large-scaled annotated corpora of
                                                               minority languages using UniParser and the EANC
 paradigm: connect_verbs-I-soft
                                                               platform. Proceedings of COLING 2012: Posters, Ch.
 trans_ru: сеять, посеять, засеять                             9: 83-91.

The entry contains fields indicating its lemma, stem,        Arkhangelskiy, T. and Lander, Yu. 2016. Developing a
grammar tags, set of inflectional affixes and Russian          polysynthetic language corpus: problems and
                                                               solutions. Computational Linguistics and Intellectual
translation.
                                                               Technologies: papers from the Annual conference
                                                               “Dialogue”.
4. Conclusion
                                                             Badmaeva, L. 2015. Natsionalnyj korpus buryatskogo
The presented pipeline, which we used for                      yazyka: predposylki i perspektivnye puti razrabotki
developing the Udmurt corpus and which is                      [Buryat National Corpus: prerequisites and future
currently used in the Adyghe corpus project, allows            development trajectories]. Vestnik Buryatskogo
for relatively inexpensive construction of digital             gosudarstvennogo universiteta, 72.
corpora. The proposed approach is applicable to              Bradley, J. 2015. corpus.mari-language.com: A
digitally represented languages which have                     Rudimentary Corpus Searchable by Syntactic and
grammars and dictionaries. According to our                    Morphological Patterns. Septentrio Conference
estimates, there are still 15 to 20 minority languages         Series, 2:57-68.
of Russia that lack comprehensive written corpora            Buskunbaeva, L. and Sirazitdinov, Z. 2011. Sistema
but have enough resources so that this approach can            razmetok v natsionalnom korpuse bashkirskogo
be applied to them.                                            yazyka [Annotation system in Bashkir National
The resulting corpora will only have morphological             Corpus]. Proceedings of “Yazyki menshinstv v
annotation and will probably be severely                       kompyuternykh tekhnologiyakh: opyt, zadachi i
unbalanced. However, development of such corpora               perspektivy”, Yoshkar-Ola: 46-51.
constitutes a necessary step for introducing higher          Goldberg, Y. and Elhadad, M. 2013. Word
levels of annotation and for achieving better                  Segmentation, Unknown-word Resolution, and




                                                         5
  Morphological Agreement in a Hebrew Parsing                      descriptive statistics. Computational Linguistics and
  System. Computational Linguistics, 39/1:121-160.                 Intellectual Technologies: papers from the Annual
                                                                   conference “Dialogue”.
Kilgarriff, A., and Grefenstette, G. 2003. Introduction to
   the special issue on the web as corpus.                       Zheltov, P. 2015. Sozdanie natsionalnogo korpusa
   Computational linguistics, 29(3):333-347.                       chuvashskogo yazyka: problemy i perspektivy
                                                                   [Development of Chuvash National Corpus:
Kornai, A. 2013. Digital Language Death. PLoS ONE
                                                                   Challenges and perspectives]. Sovremennye
  8(10): e77056. doi:10.1371/journal.pone.0077056
                                                                   problemy nauki i obrazovaniya, 1.
Leontyev, N. 2014. Natsionalnyj korpus Internet-sajtov
  gazet na yakutskom yazyke [National corpus of
  newspaper web sites in Yakut]. Zhurnal nauchnyx i
  prikladnykh issledovaniy "Infinity", 4:35-36.
Lewis, M. Paul, Gary F. Simons, and Charles D. Fennig
  (eds.). 2016. Ethnologue: Languages of the World,
  Nineteenth edition. SIL International, Dallas, Texas.
  Online version: http://www.ethnologue.com.
Moshagen, S. N., Pirinen, T. A., and Trosterud, T. 2013.
 Building an open-source development infrastructure
 for language technology projects. Proceedings of the
 19th Nordic Conference of Computational
 Linguistics (NODALIDA 2013), 343-352.
Orekhov, B. and Reshetnikov, K. 2014. K otsenke
  vikipedii kak lingvisticheskogo istochnika [Assessing
  Wikipedia as a linguistic source]. Sovremennyj
  russkiy yazyk v internete, Moscow: 310-321.
Salchak, A. and Bayirool, A. 2013. Elektronnyj korpus
   tuvinskogo yazyka: sostoyanie, problemy [Electronic
   corpus of Tuva: current state, challenges]. Mir nauki,
   kultury, obrazovaniya, 6 (43).
Segalovich, I. 2003 A fast morphological algorithm
  with unknown word guessing induced by a dictionary
  for a web search engine. Proceedings of the
  International Conference on Machine Learning;
  Models, Technologies and Applications.
  MLMTA'03. - Las Vegas: 273-280.
Sheymovich, A. 2011. Morfologicheskaya razmetka
  korpusa khakasskogo yazyka [Morphological
  annotation of the Khakas corpus]. Rossiyskaya
  tyurkologiya, 2(5):48-61.
Suleymanov, D., Khakimov, B., and Gilmullin, R. 2011.
  Korpus tatarskogo yazyka: kontseptualnye i
  lingvisticheskie aspekty [Tatar corpus: conceptual
  and linguistic aspects]. Filologiya i kultura, 26.
Winkler, E. 2001. Udmurt. Lincom Europa, München.
Zaliznyak, A. 1977. Grammaticheskiy slovar russkogo
  yazyka: slovoizmenenie [Grammatical dictionary of
  the Russian language: inflection]. Russkiy yazyk,
  Moscow.
Zaydelman, L., Krylova, I., Orekhov, B., Popov, I., and
  Stepanova, E. Russian minority languages:




                                                             6