<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bilingual Parallel Corpora Featuring the Circum-Baltic Languages within the Russian National Corpus</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dmitri Sitchinava</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natalia Perkova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>General Overview of the Project</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Higher School of Economics, 21/4 Staraya Basmannaya 105066 Moscow, Russia / Institute of the Russian language</institution>
          ,
          <addr-line>18/2 Volkhonka 119019 Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Stockholm University</institution>
          ,
          <addr-line>SE-106 91 Stockholm</addr-line>
          ,
          <country country="SE">Sweden /</country>
          <institution>Uppsala University</institution>
          ,
          <addr-line>Box 256, 751 05 Uppsala</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
      <fpage>495</fpage>
      <lpage>502</lpage>
      <abstract>
        <p>The paper presents parallel corpora within the Russian National Corpus (RNC) featuring Circum-Baltic/Russian language pairs and describes the choice of texts, morphological annotation and possible applications. The following languages of the Circum-Baltic linguistic area are included into the bilingual pairs of the corpus: Estonian, Finnish, Latvian, Lithuanian, Polish, and Swedish. The corpus includes both fiction and non-fiction texts and has a diachronic dimension. The morphological annotation of different languages is sensitive for language-specific categories and features. For each language an expanded RNC tagset is constructed which provides cross-linguistic comparison but at the same time takes into consideration differences in grammatical systems. The corpora can be used for exploring some grammatical and lexical features for the Circum-Baltic region that have no straightforward correspondence in Russian and are often rendered by other means. Further expansion of the corpus by non-fiction genres is particularly important for the study of lexicon and syntax specific for legalese, media or academic style.</p>
      </abstract>
      <kwd-group>
        <kwd>parallel corpora</kwd>
        <kwd>Circum-Baltic area</kwd>
        <kwd>grammatical typology</kwd>
        <kwd>contrastive linguistics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Parallel corpora are linguistic corpora in multiple languages consisting of original
and translated texts with corresponding alignment, most typically
sentence-bysentence. The translations included into a parallel corpus are never made for this
purpose; they are already available for ordinary readers and are supposed to convey the
original meaning of the text as accurately as possible (however, some problems and
challenges are inevitable, in particular when fiction or religious texts are involved). It
is also generally assumed that the translation is a naturally sounding text in the target
language (which is not always the case, cf. the notion of translationese [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that means
a style more or less heavily influenced by, and transparent of, the source language).
      </p>
      <p>
        Parallel corpora, including so-called massive parallel corpora [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have been
actively used as the sources of typological and contrastive linguistic information for the
analysis of different lexical and grammatical phenomena. The semantics of lexical
and grammatical items can be analyzed via natural translations of their occurrences to
another language or a group of languages. The meaning that is conveyed in the
process of translation and is (ideally) shared by the original and its translations, can be
used as tertium comparationis for the comparison of linguistic phenomena, that is the
ways in which particular meanings, or contexts, are expressed (cf. the discussion of
comparable concepts in linguistics, see [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). Translations also have the property (for
some linguists, also the advantage) of not being specially elicited for linguistic
purposes, unlike translational questionnaries widely used by typologists that aim to
collect the data which in an indirect way can be treated as parallel texts, too (see the
seminal study [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] which discusses this type of data). Among the phenomena which
have been investigated in multilingual parallel corpora, one can mention motion verbs
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], aspect in Slavic imperative forms [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], the perfect gram in European languages
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Our purpose is to incorporate parallel bilingual corpora with Russian and the
Circum-Baltic languages ([
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]) into the Russian National corpus. These
languages, belonging to different language groups (Finnic, Baltic, Slavic, and Germanic),
share some common typological traits and exhibit mutual influences within smaller
areas. The parallel corpora with Russian can be seen as a tool for researching some of
these phenomena (and of course many other grammatical and lexical items), including
those for which Russian has no direct structural correspondence, rendering them by
other means (for example, Perfect tense or marking of reported information, the
socalled evidentiality). For both geographical and historical reasons, Russia being a
close neighbour of the areal in question (and the Baltic States and Poland previously
being part of the Russian Empire and later of the USSR resp. the Soviet bloc), many
texts written in the Circum-Baltic languages are available in Russian translations and
vice versa, and new translations of modern texts in both directions appear. This makes
it possible to rely on the existence on translations from and into Russian, which is
especially useful, considering that Russian can be seen as a rather high-resource
language, compared to many languages of the area.
      </p>
      <p>
        The Russian National Corpus (henceforth RNC, http://ruscorpora.ru) already has a
set of bilingual corpora and a multilingual parallel corpus [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], searchable online and
aligned sentence-by-sentence (represented in a XML format, with the
metainformation represented in a CSV-format table). Of the languages included in
bilingual parallel corpora with Russian, the following ones belong to the linguistic
area in question: Polish (the Polish-Russian corpus available since 2010, 6 million
tokens), Estonian (launched in 2015, 600 thousand tokens), Latvian (since 2016, see
Perkova, Sitchinava 2016 for more detail; 2.5 million tokens), Swedish (since 2017,
3.6 million tokens), Lithuanian (since 2018, 560 thousand tokens), and Finnish (yet to
be published online, counting 2 million tokens). The corpora are searchable online,
and contexts up to seven sentences can be extracted and downloaded. The full texts
are not downoadable due to copyright restrictions.
      </p>
      <p>The architecture of the Latvian, Lithuanian and Swedish parallel corpora with
Russian is planned (and the texts themselves aligned) by one of the present authors,
Natalia Perkova. She also participates, together with Elizaveta Fomina, in the
Estonian subproject. The Polish-Russian Parallel Corpus was compiled by the RNC team
together with the Polish Academy of Sciences, and the Finnish-Russian corpus is
being prepared by Karina Mishchenkova in collaboration with the ParRus and ParFin
projects, headed by Mikhail Mikhailov at the University of Tampere.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Metadata markup and architecture</title>
      <p>All texts in the corpus get metatextual markup that specifies their genre, authorship
(including information concerning translators), date (both of the source and the target
text), and the direction of translation. Using these parameters subcorpora can be
customized, e.g. some linguistic phenomena can be searched within different time
periods to show possible diachronic changes. A subcorpus of a given author or translator
can be built to track the author-specific patterns that are more prominent in the
socalled “translationese” style.</p>
      <p>All the bilingual parallel corpora are planned to include both fiction and
nonfiction texts (all the texts are included in full). However, the primary focus has been
on fiction as the most obvious choice and the source of rather varied texts potentially
representative of a wide range of stylistic phenomena. Ideally, the corpus should be
representative, comprising dialogical, narrative, scientific, official and other types of
texts with their specific characteristics reflected in linguistic structures and lexicon.</p>
      <p>The bilingual corpora also feature a diachronic dimension, including texts (and
translations, though the latter naturally tend to be more recent) from different
historical periods from the 19th to the 21st century. In a multilingual (“massive parallel”)
corpus only the internationally renowned texts (e.g. Russian or Swedish 19th-century
classics) that are translated to many languages can be included; they cannot be very
numerous and in any case there cannot be many recent original texts available. In a
situation like with the Circum-Baltic/Russian language pairs where many texts have
been translated in both directions, bilingual corpora can feature a more representative
collection of culturally significant fiction of different periods, both the “cultural
canon” and contemporary texts of the 2000s and 2010s (the latter include the works by
authors like Danny Wattin or Carl-Johan Vallgren in Swedish, Ljudmila
Petrushevskaja or Marina Stepnova in Russian etc.). The Swedish texts included in
the corpus feature also the “Finnish Swedish” variant (for example, the works by
Tove Jansson). This diachronical representativeness is the major innovative feature of
the corpus as compared to the existing parallel corpora featuring Circum-Baltic
languages.</p>
      <p>Some fiction texts are included in more than one translated version, thus allowing
for polyvariant texts. For example, in the Latvian-Russian corpus this is the case of
some of Chekhov’s stories represented in the translations by Anna Grēviņa, Oskars
Kalnciems, Paulis Kalva and Regīna Ezera, or “Four Rides” by Vilis Lācis translated
to Russian by G. Ceitlin and V. Rugais. Alternative translations are provided for the
Swedish-language children’s books by Tove Jansson and Astrid Lindgren. These texts
can be used to explore variation in translation and to study the contextual synonymy
of different grammatical or lexical items.</p>
      <p>The main source of non-fiction translated to Russian is currently the site inosmi.ru
that features newspaper articles from foreign press translated to Russian from many
languages, including all the Circum-Baltic languages involved. Alongside with this,
also legal texts and international treaties are currently included into the
FinnishRussian part.</p>
    </sec>
    <sec id="sec-3">
      <title>3 Morphological annotation</title>
      <p>All the texts within the bilingual parallel corpora are morphologically annotated;
more precisely, the POS and grammatical features of wordforms are specified. Thus,
any combination of grammatical features, words and/or their parts is searchable
within the corpus.</p>
      <p>
        For Russian the Mystem analyzer developed by the Yandex company is used [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ];
the Polish morphological analysis is based on the TaKiPi algorithm that predicts tags
statistically [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The Latvian morphological analyzer was implemented in 2016 on
the basis of LUMII morphological tagger [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]; this tagger, not unlike the Polish one,
does not specify alternative morphological analyses and chooses only the one that is
most probable statistically. The Lithuanian analyzer used in the corpus is based on the
VDU (Kaunas) morphological annotator [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], for Estonian we use the Corpus
analyzer developed in Tartu University [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]; for Swedish the open-source Stagger analyzer
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>We owe the technical implementation of these taggers and their harmonization
with the XML format used in the RNC parallel corpora to Danko Aleksejevs (Latvian
and Lithuanian), Timofey Arkangelsky (Estonian and Swedish) and Boris Orekhov
(Polish).</p>
      <p>The morphological annotation of different languages is sensitive for
languagespecific categories and features. The originally obtained morphological information is
preserved as much as possible, not affected by the conversion to RNC tags. It can be
said that for each language the expanded RNC tagset is constructed, which provides
cross-linguistic comparison, but at the same time takes into consideration differences
in grammatical systems. For example, in Estonian texts, unlike elsewhere within the
RNC, the compound nouns, productive in this language, are tagged as compound and
different stems are separated by the plus sign. For Polish, the cliticized auxiliary ‘be’
in Conditional (pojechał-by-m go-BE.COND-1sg ‘I would go’) is marked as a
separate word form (with an additional tag ‘clitic’), orthographically attached to the main
verb, and at the same time as a particle (the feaure nwok means a non-vocalized
variant of the clitic as opposed to -em):</p>
      <p>An example of aligned translations of a sentence with Russian, Latvian, and
Lithuanian markup (The Man in a Case by Chekhov, translations resp. by Paulis Kalva
and E. Viskanta; the phrase means ‘It was already midnight’):</p>
      <p>&lt;se lang="ru"&gt;&lt;w&gt;&lt;ana lex="быть" sem="t:be:exist ca:noncaus d:root"
disamb="yes" gr="V,act,f,indic,intr,ipf,norm,praet,sg" sem2="ca:noncaus
d:root"/&gt;Была&lt;/w&gt; &lt;w&gt;&lt;ana lex="уже" sem="t:time" disamb="yes"
gr="ADV,norm" sem2=""/&gt;уже&lt;/w&gt; &lt;w&gt;&lt;ana lex="полночь" sem="ev:posit r:abstr
t:time" disamb="yes" gr="S,acc,f,inan,norm,sg" sem2="t:space r:concr r:abstr"/&gt;&lt;ana
lex="полночь" sem="ev:posit r:abstr t:time" disamb="yes"
gr="S,f,inan,nom,norm,sg" sem2="t:space r:concr r:abstr"/&gt;полночь&lt;/w&gt;.&lt;/se&gt;
&lt;se lang="lv"&gt;&lt;w&gt;&lt;ana lex="būt" gr="V=indic,praet,act,3p"/&gt;Bija&lt;/w&gt;
&lt;w&gt;&lt;ana lex="jau" gr="ADV,time="/&gt;jau&lt;/w&gt; &lt;w&gt;&lt;ana lex="pusnakts"
gr="S,common,f=sg,nom"/&gt;pusnakts&lt;/w&gt; .&lt;/se&gt;</p>
      <p>&lt;se lang="lt" variant_id="1"&gt;&lt;w&gt;&lt;ana lex="būti"
gr="V,nrefl=indic,praet,sg,3p"/&gt;&lt;ana lex="būti"
gr="V,nrefl=indic,praet,pl,3p"/&gt;Buvo&lt;/w&gt; &lt;w&gt;&lt;ana lex="jau"
gr="ADV=pos"/&gt;&lt;ana lex="jau" gr="PART="/&gt;jau&lt;/w&gt; &lt;w&gt;&lt;ana lex="vidurnaktis"
gr="S=m,sg,nom"/&gt;vidurnaktis&lt;/w&gt;.&lt;/se&gt;</p>
    </sec>
    <sec id="sec-4">
      <title>4 Directions of corpus-based research</title>
      <p>
        The corpora can be used for exploring some grammatical and lexical features for
the Circum-Baltic region that have no straightforward correspondence in 1Russian
and are often rendered by other means. The Circum-Baltic linguistic area is
characteristic for having the perfect aspectual gram and in addition the possessive perfect
construction (see [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] on the Latvian and Lithuanian forms in a parallel corpus); even
Polish has a new possessive perfect construction. This opposition is lost in Modern
Russian, but, interestingly, reappeared in the Russian and Belarusian North-Western
dialects under the influence of Finnic and Baltic languages (see [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] for more detail).
Such perfect-based tenses like pluperfect or future perfect tend to get secondary
meanings. More particularly, the future perfect forms in the Baltic languages have
secondary semantics of hypothetic events (or inferential with past time reference, not
unlike its use in other languages of Europe, see [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] on typological context). Overall,
only 25% of the Lithuanian future prefect forms in the corpus have future time
reference, whereas about a half of the examples signal hypotheses and/or inference
concerning the past events and the remaining 25% fall to the category of precedence with
regard to an irreal or habitual situation. For Latvian, the corresponding numbers are
even more impressing in terms of semantic non-compositionality, resp. 27%, 66% and
2% (with some other marginal or ambiguous uses).
      </p>
      <p>In Russian, a wide range of discourse markers corresponds to the hypothetic and
inferential usages of Future Perfect in Baltic, e.g., navernoe ‘perhaps’, dumaju ‘I
think’, konečno ‘certainly’ or even ordinary past tense forms without additional
markers:</p>
      <p>Latv. Zini, vecomāt, es laikam arī būšu iemīlējusies [fall.in.love.FUT.PERF].
[Zenta Ērgle. Starp mums, meitenēm, runājot... (1976)]</p>
      <p>Rus. Znaeš, babuška, ja navernoe [probably] vljubilas’ [fall.in.love.PST]. [(Ž. Ezit
trans. 1979)]
‘You know, Granny, I have [probably] fallen in love’</p>
      <p>Rus. Ètogo nikogda ne bylo… serdce šalit… ja pereutomilsja [overworked.PST]
[M. Bulgakov. Master and Margarita, 1925-1940].</p>
      <p>Latv. Tas nu nekad nav bijis … sirds streiko … būšu pārpūlējies
[overworked.FUT.PERF]. [Ojārs Vācietis trans.]</p>
      <p>‘This has never happened before. My heart's acting up… I'm [evidently]
overworked…’</p>
      <p>Lit. Čia, tose plynėse, tuose miškuose, ant šitų kelių ir takų viskas bus prasidėję
[start.FUT.PERF], ėmę gauti prasmę… [Juozas Aputis. Lidija Skoblikova ir tėvo
žingsniai (1980-1989)]</p>
      <p>Rus. Imenno zdes’, v ètix pustošax, v ètix lesax, na ètix dorogax i tropax, vsë,
požaluj [perhaps], i načalos’ [start.PST], stalo obretat’ smysl… [Virgilijus Čepaitis
trans., 1989]</p>
      <p>‘It was perhaps here, in these wastelands and woods, on these roads and paths,
where all these things emerged and started to make sense’</p>
      <p>Lexical correspondences can also be investigated on the basis of data from parallel
texts. For instance, the Russian word toska ‘~yearning, anguish, misery’ is rendered in
most languages with a very high diversity and statistical entropy of different
translations. More particularly, the Estonian-Russian parallel corpus (currently relatively
small) already counts seven Estonian correspondences for toska, namely koduigatsus
‘~nostalgia/homesickness’, ahastama ‘~anguish, depression’, masendus
‘~depression’, mure ‘~anxiety’, nukrus ‘~sadness, grief’, tusk ‘~chagrin’
(interestingly, an old borrowing from Slavic and related to toska), igatsus ‘yearning, longing’.</p>
      <p>Future development of the Circum-Baltic parallel corpora with Russian includes
expansion of corpora aimed to cover all the periods of fiction since the 19th century to
the modern texts. Further expansion of the corpus by non-fiction genres is particularly
important for the study of lexicon and syntax specific for legalese, media or academic
style.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The research is supported by the Russian Foundation for Basic Research, project
1734-01061-OGN “Slavic future anterior in a typological perspective”</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arkadiev</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daugavet</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The perfect in Lithuanian and Latvian: a contrastive investigation</article-title>
          .
          <source>Talk at Academia Grammaticorum Salensis Tertia Decima, 1-6 August</source>
          <year>2016</year>
          (http://inslav.ru/sites/default/files/arkadievdaugavet2016_baltperf_salos.pdf)
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cysouw</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wälchli</surname>
            <given-names>B</given-names>
          </string-name>
          . (eds.): Parallel Texts.
          <article-title>Using Translational Equivalents in Linguistic Typology</article-title>
          .
          <source>Theme issue in: Sprachtypologie &amp; Universalienforschung STUF</source>
          <volume>60</volume>
          (
          <issue>2</issue>
          ) (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dahl</surname>
            , Ö.: Tense and
            <given-names>Aspect</given-names>
          </string-name>
          <string-name>
            <surname>Systems</surname>
          </string-name>
          . Blackwell, Oxford (
          <year>1985</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dahl</surname>
          </string-name>
          , Ö.:
          <article-title>The perfect map: Investigating the cross-linguistic distribution of TAME categories in a parallel corpus</article-title>
          . In: Szmrecsanyi,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Wälchli</surname>
          </string-name>
          ,
          <string-name>
            <surname>B</surname>
          </string-name>
          . (eds.).
          <source>Aggregating Dialectology</source>
          , Typology, and
          <article-title>Register Contents Analysis</article-title>
          .
          <source>Linguistic Variation in Text and Speech. Linguae &amp; litterae 28</source>
          , pp.
          <fpage>268</fpage>
          -
          <lpage>289</lpage>
          . Walter de Gruyter, Berlin (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dahl</surname>
          </string-name>
          , Ö.,
          <string-name>
            <surname>Koptjevskaja-Tamm</surname>
          </string-name>
          , M. (eds.):
          <article-title>Circum-Baltic languages. Typology and contact</article-title>
          . Vol.
          <volume>1</volume>
          -
          <issue>2</issue>
          , Benjamins, Amsterdam-Philadelphia (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gellerstam</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Translationese in Swedish novels translated from English</article-title>
          .
          <source>In: Translation studies in Scandinavia</source>
          , pp.
          <fpage>88</fpage>
          -
          <lpage>95</lpage>
          . CWK Gleerup,
          <string-name>
            <surname>Malmö</surname>
          </string-name>
          (
          <year>1986</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Haspelmath</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Comparative concepts and descriptive categories in crosslinguistic studies</article-title>
          .
          <source>Language</source>
          <volume>86</volume>
          (
          <issue>3</issue>
          ),
          <fpage>663</fpage>
          -
          <lpage>687</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kaalep</surname>
          </string-name>
          , H.-J.,
          <string-name>
            <surname>Muischnek</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müürisep</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          .
          <string-name>
            <surname>Rääbis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Habicht</surname>
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Kas tegelik tekst allub eesti keele morfoloogilistele kirjeldustele? Eesti kirjakeele testkorpuse morfosüntaktilise märgendamise kogemusest</article-title>
          .
          <source>Keel ja Kirjandus</source>
          <volume>9</volume>
          ,
          <fpage>623</fpage>
          -
          <lpage>633</lpage>
          (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Östling</surname>
          </string-name>
          , R.:
          <article-title>Stagger: an Open-Source Part of Speech Tagger for Swedish</article-title>
          .
          <source>Northern European Journal of Language Technology</source>
          <volume>3</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Pen</surname>
          </string-name>
          <article-title>'kova</article-title>
          , J.:
          <article-title>Ot retrospektivnosti k prospektivnosti: grammatikalizacija predbuduschego v jazykax Evropy [Russian: From Retrospectiveness to Prospectiveness: Grammaticalization of Antefuturum in the Languages of Europe]</article-title>
          .
          <source>Voprosy jazykoznanija 2</source>
          ,
          <fpage>53</fpage>
          ‒
          <lpage>70</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Perkova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sitchinava</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>On the Development of a Latvian-Russian Parallel Corpus</article-title>
          . In: Skadiņa,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Rozis</surname>
          </string-name>
          ,
          <string-name>
            <surname>R</surname>
          </string-name>
          . (eds.).
          <source>Human Language Technologies - The Baltic Perspective: Proceedings of the Seventh International Conference Baltic HLT</source>
          <year>2016</year>
          , pp.
          <fpage>130</fpage>
          -
          <lpage>135</lpage>
          . IOS Press, Amsterdam (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Piasecki</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radziszewski</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Godlewski</surname>
            <given-names>G.</given-names>
          </string-name>
          et al.:
          <article-title>TaKIPI, CLARIN-PL digital repository (</article-title>
          <year>2014</year>
          ), http://hdl.handle.net/11321/31, last accessed
          <year>2019</year>
          /02/11.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Rimkutė</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daudaravičius</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Utka</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Morphological Annotation of the Lithuanian Corpus. 45th Annual Meeting of the Association for Computational Linguistics</article-title>
          . In: Workshop Balto-Slavonic
          <source>Natural Language Processing</source>
          , pp.
          <fpage>94</fpage>
          -
          <lpage>99</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Seržant</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiemer</surname>
            ,
            <given-names>B</given-names>
          </string-name>
          . (eds.).:
          <article-title>Contemporary approaches to dialectology: the area of North, North-West Russian and Belarusian dialects</article-title>
          . University of Bergen, Bergen (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Segalovich</surname>
          </string-name>
          , I.:
          <article-title>A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine</article-title>
          .
          <source>In: Machine Learning; Models, Technologies and Applications</source>
          , Las
          <string-name>
            <surname>Vegas</surname>
          </string-name>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sitchinava</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Parallel corpora within the Russian National Corpus</article-title>
          .
          <source>Prace Filologiczne</source>
          <volume>63</volume>
          ,
          <fpage>271</fpage>
          -
          <lpage>278</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Wälchl</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cysouw</surname>
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Lexical typology through similarity semantics: Toward a semantic map of motion verbs</article-title>
          .
          <source>Linguistics</source>
          <volume>50</volume>
          (
          <issue>3</issue>
          ),
          <fpage>671</fpage>
          -
          <lpage>710</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Waldenfels</surname>
            <given-names>R.</given-names>
          </string-name>
          <article-title>von: Explorations into variation across Slavic: Taking a bottom-up approach</article-title>
          . In: Szmrecsanyi,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Wälchli</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          . (eds.): Aggregating Dialectology, Typology, and
          <article-title>Register Contents Analysis</article-title>
          .
          <source>Linguistic Variation in Text and Speech. Linguae &amp; Litterae 28</source>
          , pp.
          <fpage>290</fpage>
          -
          <lpage>323</lpage>
          . Walter de Gruyter, Berlin (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>