<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Aligned Kazakh-Russian Parallel Corpus Focused on the Criminal Theme</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Al-Farabi Kazakh National University</institution>
          ,
          <addr-line>71 al-Farabi Ave., Almaty</addr-line>
          ,
          <country>Republic of Kazakhstan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Information and Computational Technologies</institution>
          ,
          <addr-line>125, Pushkin str., 050010, Almaty</addr-line>
          ,
          <country>Republic of Kazakhstan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National Technical University “Kharkiv Polytechnic Institute”</institution>
          ,
          <addr-line>2, Kyrpychova str., 61002, Kharkiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1949</year>
      </pub-date>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Nowadays, the development of high-quality parallel aligned text corpora is one of the most relevant and advanced directions of modern linguistics. Special emphasis is placed in creating parallel multilingual corpora for low resourced languages, such as the Kazakh language. In the study, we explored texts from four Kazakh bilingual news websites and created the parallel Kazakh-Russian corpus of texts that focus on the criminal subject at their base. In order to align the corpus, we used lexical compliances set and the values of POS-tagging of both languages. 60% of our corpus sentences are automatically aligned correctly. Finally, we analyzed the factors affecting the percentage of errors.</p>
      </abstract>
      <kwd-group>
        <kwd>criminal subject</kwd>
        <kwd>news websites</kwd>
        <kwd>POS-tagging</kwd>
        <kwd>Kazakh-Russian parallel corpus</kwd>
        <kwd>alignment</kwd>
        <kwd>lexical compliances</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>As of today, linguistic resources are not only a part of any linguistics study but an
important base for designing any NLP applications. Such resources typically include
dictionaries, thesauri, linguistics ontologies, monolingual and multilingual corpora. In
order to create these linguistics resources lexicographic researches, analysis of the
lexical structure of languages, exploring the text characteristics and similar studies are
being conducted.</p>
      <p>
        Design and creation, development and use of high-quality text corpora are one of
the most relevant and advanced directions of modern linguistics [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Such processed
and systematized by means of concordancer corpus allows storing a large amount of
text information necessary for the statistical analysis of the linguistic phenomena and
diachronic change in spoken and written languages.
      </p>
      <p>There are a lot of types of corpora. There are specialized corpora (genre, time,
place), general corpora, multilingual corpora, learner corpora, historical or diachronic
corpora, monitor corpora and multilingual corpora. Multilingual corpora, in turn, are
divided into comparative (comparable corpus) and parallel or the corpus of the
translations (translation corpus).</p>
      <p>In our opinion, parallel text corpora are particulary important in studying language
and features of the translation, various parsing, tasks of speech recognition, etc. For
instance, in the tasks of foreign language training, such corpora allow finding possible
equivalents of the analyzed lexicon, tracking its values and functions in some
contexts.</p>
      <p>Furthermore, the concept of the parallel corpus is an integral part of the broader
and more difficult concept, such as – machine translation. It is known that machine
translation is still the unresolved task of computational linguistics, despite the rapid
growth of the various program and empirical resources. In some times the quality of
machine translation also depends on the amount of parallel sentences used in training.</p>
      <p>
        For the last decade in the world there was created the set of bilingual and
multilingual corpora, among which, in our view, the most exciting are:
EUROPARL20.000.000 word usage, the open corpus of European Parliament in 11 languages,)
(https://www.isi.edu/~koehn/publications/europarl/); CHEMNITZ
GERMANENGLISH TRANSLATION CORPUS – 1.000.000 word usage
(http://www.tuchemnitz.de/phil/InternetGrammar); KACENKA (Korpus anglicko-cesky; Czech)
3.000.000 word usage (http://www.phil.muni.cz/angl/kacenka/kachna.html); OPUS
(5 languages) (https://aclanthology.info/papers/L04-1174/l04-1174); English-French
Canadian Hansard [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The most prominent and the greatest Kazakh language corpora are: Almaty Corpus
of Kazakh (http://web-corpora.net/KazakhCorpus/search/), containing more than 40
million word usage, 86% of word usage have grammatical analysis; Kazakh text
corpora on Sketch Engine [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]; Open-Source-Kazakh-Corpus, created with the use
of the Wikipedia dump tool and including a collection of 20 million words (600
thousand of them are unique) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; Kazakh Language Corpus (KLC) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>At the same time, despite the existence of a large number of parallel multilingual
corpora, for low resourced languages, such as the Kazakh language, the task of
parallel corpora creating is vital. The task becomes more complex when we say about the
development of parallel corpora for not similar languages, the languages from
different families. For instance, one language belongs to the Turkic language family and the
other belongs to the Indo-European language family, as Kazakh and Russian.</p>
      <p>In our study, we explored texts in two languages (Russian and Kazakh) from
Kazakh bilingual news websites and created the parallel Kazakh-Russian corpus at the
base of these sites' texts. Moreover, texts of our parallel corpus do not belong to
fiction or another broad theme; they focus on the criminal subject that makes them
limited-field. Therefore, we were able to apply the dictionary method to align the corpus.
In addition, to improve the quality of sentence alignment we have made POS-tagging
of the texts in both languages and then exploit the labelling. Finally, we calculate the
percentage of correct aligned sentences and analyzed the factors affecting the
percentage of error.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Parallel corpora contain the text of the original and its translation into some other
language. Additionally, these two texts are not just opposed each other, they have to
be aligned: particular fragments of the original text have to coincide with the
corresponding fragments of the translation. We can say that a parallel corpus is only useful
when it is aligned.</p>
      <p>In most studies, two levels of alignment are explicitly or implicitly distinguish:
sentence alignment and lexical alignment. Generally, the task of automatic
comparison of sentences or words in one text to their equivalents in translation is very labor
intensive as this consistency between words or sentences is sometimes not “one to
one”. For instance, a few paragraphs in source language can correspond to one
paragraph in the target language, in translation some words can be deleted or replaced
with very distant synonyms or fixed phrases which can be absolutely various in
different languages, etc.</p>
      <p>
        We can classify sentence alignment methods into 3 three categories. Methods of
the first category are based on the use of lengths of sentences and paragraphs [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This
approach uses a hypothesis that the length of the sentence in the original and in
translation approximately match.
      </p>
      <p>
        The second group of methods uses lexical information, which can be received from
the corpus [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Unfortunately, these methods are applied extremely rarely, which
is due to inaccessibility of bilingual dictionaries and difficulties of the automatic
morphological analysis to mutual identification of words in dictionaries. To date, most of
applications based on this group of methods exploit only texts of specialized subjects,
for example, texts of parliaments and legal texts. The use of dictionary methods for
literary texts is rare because even in a similar genre there is a high percentage of the
ambiguity of vocabulary in compared sentences.
      </p>
      <p>
        The third group of texts alignment algorithms in parallel corpora is based on the
POS-tags which are contained in an annotated corpus or use spelling similarity [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        However, the use of any method of these groups has a number of inaccuracies and
weaknesses. Accordingly, nowadays interest in the development of systems which
apply the combination of all approaches constantly grows. For instance, Varga et al.
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] described a hybrid method of parallel text alignment. They used the alignment
technique that combines the length-based method with some kind of translation-based
similarity. The basis of research was formed by Hungarian, Romanian, and Slovenian
languages.
      </p>
      <p>
        Rico Sennrich and Martin Volk [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] in their study showed that sentence alignment
can be achieved without the use of language-specific resources other than the
to-bealigned parallel text. They used a length-based sentence alignment algorithm and train
an SMT system on the to-be-aligned text. Such a system is used to translate the source
side of the parallel training corpus and then it bases its sentence alignment on this
translation. In their study, they proved that the iterative sentence alignment approach
leads to the best results after just two iterations.
      </p>
      <p>
        Another approach to sentence alignment is described in the article [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In this
paper, authors proposed their own fast and robust sentence alignment algorithm −
FastChampollion, which employs a combination of both length-based and lexicon-based
algorithm. The method is called “fast” because it optimized the process of splitting
the input bilingual texts into small fragments for alignment. This method needs a
dictionary for aligning sentences, but its precision and recall will drop as the size of
the dictionary decreases.
      </p>
      <p>
        Vondricka [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] made the review of a special application InterText for alignment of
parallel corpora that is based on some hybrid alignment methods. This resource exists
in two forms: InterText Server (server based on the text management system with
web-based editor interface) and InterText editor (personal desktop application). Both
are open-source software. The same application was used for the creation of
KazakhEnglish text corpora in the study by Zhumanov et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Furthermore, authors
exploited Bitextor, and hunalign tools in order to crawl websites that contain the same
texts in several languages and aligned them. The article by Rakhimova et al.[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is
devoted to the meeting similar challenges. They considered the principles of use of
such application as Bitextor, which generates translation memories using multilingual
websites as a corpus source. It downloads an entire website and applies a set of
heuristics (based mainly on HTML tag structure and text block length) to find bitexts.
      </p>
      <p>
        Authors of the article [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] aligned their dataset at the sentence level, it means they
used strong punctuation for the text segmentation. But such an approach de-mands
verification of text correctness and proximity of languages. So the manual control is
required. As a result, all medical and all literary texts in the Polish/Ukrainian pair has
been aligned and verified, while only part of French and English texts is still being
operated. At the same time, Finnish and Russian versions of the Aranea corpora and
the newspaper subcorpus of the Russian national corpus and a corpus of the Finnish
national library are aligned sufficiently well [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        The creation of aligned parallel corpora faces the additional challenge of the search
of textual resources for a parallel corpus. Nowadays there are a lot of researches
relating to obtaining parallel sentences from non-parallel or comparable data. For
example, such linguistic data can be obtained from Wikipedia. This is an extremely
valuable resource for extracting parallel sentences, as the document alignment is already
provided and Wikipedia articles on the same topic can be in different languages. In
addition, they are connected via “interwiki” links, which are annotated by users [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
However, Wikipedia has not been thoroughly explored yet [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        Consequently, we can make a conclusion that the problem of parallel corpora
alignment has still not been solved up to the end and the universal method has not
been found. Furthermore, to date, most studies consider that the choice of the
alignment method depends on the researched language pair, thematic focus of texts and
types of documents represented in the corpus [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Dataset preparation</title>
      <p>The parallel corpus creation task includes several parts. The first and one of the most
important tasks is to collect text material for such corpora. Despite the fact that the
Internet contains a large number of websites which are bilingual and multilingual, the
choice of needed bilingual resources constitutes an important part of the parallel
corpus elaboration. This task becomes more complicated by the fact that we process
such different languages as Kazakh and Russian. Additionally, it is necessary to use
specialized tools and various techniques not only for language processing but also for
collecting necessary material for corpora.</p>
      <p>We suggest that parsing of the websites is the best way to automate the process of
collecting and saving information. For our study, we have developed our own special
software for automatic parsing of the websites, which allow parsing websites that can
be similar in design, content and structure.</p>
      <p>Four bilingual websites zakon.kz caravan.kz, lenta.kz, nur.kz were chosen for the
developed parser. The selected websites represent well-known and reliable news
websites of Kazakhstan that are the main news sources on the criminal subject. They
contain a large number of articles according to the criminal information, for example,
different offences such as robberies, car thefts, murders, car incidents and others,
which is one of the goals of our study. Additionally, the websites can switch text
information between two languages: Russian and Kazakh.</p>
      <p>As a result of a program runtime, we have received the general set of 3000 texts in
two languages: Russian and Kazakh. From them, we have selected manually the test
set for the creation of the aligned parallel corpus of the Russian-Kazakh texts on
criminal information. The corpus size is more than 50410 words, about 24800 of them in
Russian and about 25600 words in Kazakh.</p>
      <p>On the next step, we determined the structure of the corpus organization.
Nowadays such structure can be very diverse, depending on the pragmatical purposes of its
creators or users:
 in the form of the traditional text with reference to the translations,
 in a tabular "mirror" form that is more convenient for perception and comparison,
 in the form of the database.</p>
      <p>For our study, we chose the database structure as it is the most convenient way for
storage of a large amount of data with a possibility of its further increasing.</p>
      <p>All news articles are stored in the table of the database which includes their ID,
title, the address of the website and the text of an article.</p>
      <p>At the following stage, we carried out POS-tagging of the corpus. For a Russian
corpus labelling, we chose the pymorphy2 (https://nlpub.ru/Pymorphy) Python packet
which is specially developed for morphological analysis of Russian and Ukrainian
texts. The libraries of the packet use the OpenCorpora
(https://www.pydoc.io/pypi/gensim-3.2.0/autoapi/corpora/dictionary/index.html)
dictionary and make hypothetical conclusions for non-recognized words.</p>
      <p>In turn, the complexity, structural and typological characteristic of Kazakh marking
is connected with the fact that it belongs to agglutinating languages. Structure of this
language is rather difficult and unusual, since your native language is inflectional.
The agglutinative formation is opposite inflectional where every formant has several
inseparable meanings at once (for example, a case, gender, number, etc.). In this
reason, we make POS-tagging of Kazakh texts via the regular expression tagger based on
RegexpTagger class of nltk Python (https://www.nltk.org/) package. For example, we
can identify some types of nouns in Kazakh texts via the following list of regular
expressions:
patterns=[(r’.*бен$','NN'), (‘r.* пенен $','NN'), (‘r.*
басшылық $','NN'), (r’.* іпқону $','NN'), (r’.* тармен
$','NN'), (r’.* герлермен $','NN'), (r’.* здар $','NN')]
Additionally, to increase recall and precision of our POS-tagging of Kazakh texts we
combine regular expressions with the system that includes seven rules. For instance,
"If a word followed by words from the special list — the word is marked as Verb".
4</p>
    </sec>
    <sec id="sec-4">
      <title>The automatic alignment of the corpus</title>
      <p>At the first step of the automatic alignment of our corpus, we were guided by
punctuation symbols, capital letters and paragraphs. At the next step, it is possible to select
two basic approaches to sentence alignment. The first approach that provides
significantly higher productivity is based on sentence length. In the second, more
resourceintensive approach, the lexical compliances set in by a word alignment method. In our
research, the first approach will not yield exact and objective results as the Kazakh
language is agglutinating. It means that the form of a word is formed by addition
affixes as well as auxiliary additional words carrying semantic and morphological
information. In this reason, the use of alignment on the length of sentences or
paragraphs of inflectional and the agglutinating languages is not an effective method.</p>
      <p>Upon detailed studying of this area, we revealed that for the languages belonging
to different language groups and further for specialized, thematic texts it is the best of
all to apply the dictionary method of alignment. On the basis of this conclusion, we
exploit the lexical compliances set and the values of POS-tagging obtained in
previous stages of preparation. The main reason why we were not able to use the first
easier approach to sentences alignment is a huge difference between syntax and semantics
of Kazakh and Russian languages.</p>
      <p>As a lexical compliances set we use our own Kazakh-Russian dictionary, which is
based on the English-Kazakh-Russian dictionary that contains about 50 000 entries.
Figures 2 shows the fragment of the English-Kazakh-Russian dictionary, which we
use as a background one. The dictionary contains about 50 000 entries. Figures 3
shows the fragment of Kazakh-Russian dictionary, which we apply to align parallel
Kazakh - Russian corpus.</p>
      <p>To improve the sentence alignment, apart from the dictionary method, we use
knowledge about the POS-tagging of the words in sentences. Such an approach will
allow improving results of the dictionary method as the Kazakh words have several
variants of the translations. Thanks to the correct marking of the words in the texts it
is possible to improve the best translation equivalent.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Experiments and results</title>
      <p>Our parallel Kazakh-Russian corpus contains texts from certain Kazakh news sites for
the period 2018 – 2019. The corpus includes more than 50410 words, about 24800 of
them in Russian and about 25600 words in Kazakh.</p>
      <p>In order to assess the correctness of aligned sentences, we leverage experts’
opinion, which are native speakers of the Kazakh language as well as the Russian
language.</p>
      <p>A well-designed special application allows the experts to choose the text in any
(Russian or Kazakh) language and automatically load the parallel file of text. When
working with a corpus, the expert may mark texts, save them with marking and align
them manually. Figure 4 shows the user interface of our special universal application
for working with aligned parallel corpora.</p>
      <p>As a result of estimating at least of three experts, it was determined that about 60% of
sentences in our parallel Kazakh-Russian corpus are automatically correctly aligned.
The rest of the sentences need to be aligned manually.</p>
      <p>In our opinion, such a percentage is connected to the following factors.
1. Not full coincidence by the number of sentences in corpora. In connection with the
complexity of syntactic structures of the Kazakh language, some sentences do not
correspond on the structure to Russian equivalents and divide to a few sentences.
2. The complexity of dictionary base creation. The basis of the dictionary method lies
in the qualitative-designed dictionary. As the Russian and Kazakh languages are in
the distant language groups, during the creation of such a dictionary, there are
difficulties with accurate translation.
3. The complexity and limitation of using comparative grammar for the Kazakh and
Russian languages in our study. The analysis and further development of this
approach will allow improving the result of the alignment.
4. The dictionary method does not consider proper nouns. Texts on criminal subject
contain a large number of such words, especially in headings.</p>
      <p>All these mismatches can significantly affect the results of alignment.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and further work</title>
      <p>Parallel corpora are used for the meeting of different challenges, such as development
and setting the machine translation systems, comparative studying of languages,
language training. Development of such corpora is particularly important for
lowresource languages and for pairs of languages related to different groups, for example,
for Russian and Kazakh.</p>
      <p>The developed parallel Kazakh-Russian corpus is created on the basis of four
multilingual news websites of Kazakhstan from which the specialized criminal
information was selected. The corpus contains 50410 words from which 25600 relates to
Kazakh, and 24800 to Russian.</p>
      <p>The corpus is aligned with the use of the specially configurated dictionary and
knowledge of POS-tagging of both texts. Additionally, the corpus is provided with
the special software application allowing adding specialized information to the
corpus. The expert assessment of the automatic alignment is 60% of correctly aligned
texts. In the next phase of the study, we plan to classify and analyze the mistakes
connected with the alignment of the corpus. For that, we will involve a group of
philologists of the Kazakh and Russian languages to the professional analysis and
assessment of the results.</p>
      <p>The developed aligned Kazakh-Russian parallel corpus can be used as training data
for machine translation, identification and extraction of the texts connected with
crime and for various NLP tasks.</p>
      <p>The following step of our study is greater involvement in a stage of the information
alignment using POS-tagging which is limited by the complexity of such full marking
for the Kazakh texts now.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgment</title>
      <p>This research is supported by the Committee of Science of the Ministry of Education
and Science of the Republic of Kazakhstan (project No. AP05131073 – Methods,
models of retrieval and analyses of criminal contained information in semi-structured
and unstructured textual arrays).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Rizun</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Waloszek</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Methodology for Text Classification using Manually Created Corpora-based Sentiment Dictionary</article-title>
          .
          <source>In: Proceedings of the 10th International Joint Conference on Knowledge Discovery</source>
          ,
          <article-title>Knowledge Engineering and Knowledge Management (IC3K</article-title>
          <year>2018</year>
          )
          <article-title>- Volume 1: KDIR</article-title>
          , pp.
          <fpage>212</fpage>
          -
          <lpage>220</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gale</surname>
            ,
            <given-names>W.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Church</surname>
            ,
            <given-names>K.W.:</given-names>
          </string-name>
          <article-title>A program for aligning sentences in bilingual corpora</article-title>
          .
          <source>In: ACL'93 29th Annual Meeting</source>
          , vol.
          <volume>19</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>75</fpage>
          -
          <lpage>102</lpage>
          . USA (NJ) (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kilgarriff</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baisa</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bušta</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakubíček</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovář</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michelfeit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rychlý</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>VíSuchomel</surname>
          </string-name>
          , V.:
          <article-title>The Sketch Engine: Ten Years On</article-title>
          . In: Lexicography, pp.
          <fpage>7</fpage>
          -
          <lpage>36</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chapaev</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turapbekov</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Building Kazakh language open source corpora using wikipedia resources</article-title>
          . In: Suleyman Demirel University Bulletin, pp.
          <fpage>153</fpage>
          -
          <lpage>160</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Makhambetov</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makazhanov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yessenbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matkarimov</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sabyrgaliyev</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharafudinov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Assembling the Kazakh Language Corpus</article-title>
          .
          <source>In: Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          , pp.
          <fpage>1022</fpage>
          -
          <lpage>1033</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kay</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roscheisen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Text translation alignment</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>19</volume>
          (
          <issue>1</issue>
          ),
          <fpage>121</fpage>
          -
          <lpage>142</lpage>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fung</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McKeown</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Aligning noisy parallel corpora across language groups: word pair feature matching by dynamic time warping</article-title>
          .
          <source>In: Proceedings of the First Conference of the Association for Machine Translation in the Americas (AMTA-94)</source>
          , pp.
          <fpage>81</fpage>
          -
          <lpage>88</lpage>
          . Columbia, Maryland, USA (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Simard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foster</surname>
            <given-names>G.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isabelle</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <article-title>Using cognates to align sentences in bilingual corpora</article-title>
          .
          <source>In: Proceedings of the Fourth International conference on theoretical and methodological issues in Machine translation (TMI</source>
          <year>1992</year>
          ), pp.
          <fpage>67</fpage>
          -
          <lpage>81</lpage>
          (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Varga</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halacsy</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kornai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nagy</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nemeth</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tron</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Parallel corpora for medium density languages</article-title>
          .
          <source>Amsterdam Studies In: The Theory And History Of Linguistic Science Series</source>
          <volume>4</volume>
          (
          <issue>292</issue>
          ),
          <volume>247</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Sennrich</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Volk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Iterative, MT-based Sentence Alignment of Parallel Texts</article-title>
          .
          <source>In: Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA</source>
          <year>2011</year>
          ), pp.
          <fpage>175</fpage>
          -
          <lpage>182</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Fast-Champollion</surname>
          </string-name>
          :
          <article-title>A Fast and Robust Sentence Alignment Algorithm</article-title>
          .
          <source>In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters</source>
          , pp.
          <fpage>710</fpage>
          -
          <lpage>718</lpage>
          . Beijing, China (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Vondricka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Aligning parallel texts with InterText</article-title>
          .
          <source>In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)</source>
          . pp.
          <fpage>1875</fpage>
          -
          <lpage>1879</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Zhumanov</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madiyeva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rakhimova</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>New Kazakh parallel text corpora with online access</article-title>
          .
          <source>In: Conference on Computational Collective Intelligence Technologies and Applications</source>
          , pp.
          <fpage>501</fpage>
          -
          <lpage>508</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Rakhimova</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhumanov</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :.
          <source>Complex Technology of Machine Translation Resources Extension for the Kazakh Language. Advanced Topics in Intelligent Information and Database Systems</source>
          . Springer International Publishing, Almaty, Kazakhstan (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Grabar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanishcheva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Multilingual aligned corpus with Ukrainian as the target language</article-title>
          .
          <source>In: SLAVICORP</source>
          , pp.
          <fpage>53</fpage>
          -
          <lpage>57</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Harme</surname>
          </string-name>
          , J.:
          <article-title>Last year but not yesterday? Explaining differences in the locations of Finnish and Russian time adverbials using comparable corpora</article-title>
          .
          <source>In: SLAVICORP</source>
          , pp.
          <fpage>60</fpage>
          -
          <lpage>63</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quirk</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment</article-title>
          .
          <source>In: Proceedings of the Human language Technologies/North American Assosiation for Computational Linguistics</source>
          , pp.
          <fpage>403</fpage>
          -
          <lpage>411</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Lewoniewski</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Węcel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abramowicz</surname>
            ,
            <given-names>W. Quality</given-names>
          </string-name>
          <article-title>and importance of Wikipedia articles in different languages</article-title>
          .
          <source>In International Conference on Information and Software Technologies</source>
          , pp.
          <fpage>613</fpage>
          -
          <lpage>624</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>In search of the best method for sentence alignment in parallel texts</article-title>
          .
          <source>In Computer treatment of Slavic and East European languages. Third international seminar</source>
          , pp.
          <fpage>174</fpage>
          -
          <lpage>185</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>