<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extended Language Modeling Experiments for Kazakh</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bagdat Myrzakhmetov</string-name>
          <email>bagdat.myrzakhmetov@nu.edu.kz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhanibek Kozhirbayev</string-name>
          <email>zhanibek.kozhirbayev@nu.edu.kz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Laboratory Astana, Nazarbayev University</institution>
          ,
          <addr-line>Astana, 010000</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Nazarbayev University, School of Science and Technology</institution>
          ,
          <addr-line>Astana, 010000</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this article we present dataset for the Kazakh language for the language modeling. It is an analogue of the Penn Treebank dataset for the Kazakh language as we followed all instructions to create it. The main source for our dataset is articles on the web-pages which were primarily written in Kazakh since there are many new articles translated into Kazakh in Kazakhstan. The dataset is publicly available for research purposes1. Several experiments were conducted with this dataset. Together with the traditional n-gram models, we created neural network models for the word-based language model (LM). The latter model on the basis of large parameterized long short-term memory (LSTM) shows the best performance. Since the Kazakh language is considered as an agglutinative language and it might have high out-of-vocabulary (OOV) rate on unseen datasets, we also carried on morph-based LM. With regard to experimental results, sub-word based LM is fitted well for Kazakh in both ngram and neural net models compare to word-based LM.</p>
      </abstract>
      <kwd-group>
        <kwd>Language Modeling</kwd>
        <kwd>Kazakh language</kwd>
        <kwd>n-gram</kwd>
        <kwd>neural language models</kwd>
        <kwd>morph-based models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The main task of the language model is to determine whether the particular sequence
of words is appropriate or not in some context, determining whether the sequence is
accepted or discarded. It is used in various areas such as speech recognition, machine
translation, handwriting recognition [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], spelling correction [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], augmentative
communication [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Natural Language Processing tasks (part-of-speech tagging,
natural language generation, word similarity, machine translation) [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. Strict rules
may be required depending on the task, in which case language models are created by
humans and hand constructed networks are used. However, development of the
rulebased approaches is difficult and it even requires costly human efforts if large
vocabularies are involved. Also usefulness of this approach is limited: in most cases
(especially when a large vocabulary used) rules are inflexible and human mostly produces
the ungrammatical sequences of words during the speech. One thing, as [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] states, in
most cases the task of language modeling is “to predict how likely the
      </p>
      <sec id="sec-1-1">
        <title>1 https://github.com/Baghdat/LSTM-LM/tree/master/data/</title>
        <p>sequence
of
words is”, not to reject or accep-tbasaesd lainguaruglee modeling. For that reason,
statistical probabilistic language models were developed.</p>
        <p>A large number of word sequences are required to create the language models.
Therefore the language model should be able to assign probabilities not only for small
amounts of words, but also for the whole sentence. Nowadays it’s
large and readable text corpora consisting of millions of words, and language models
can be created by using this corpus.</p>
        <p>
          In this work, we first created the datasets for the language modeling experiments.
We built an analogy of the Penn Treebank corpus for the Kazakh language and to do
so we followed all preprocessing steps and the corpus sizes. The Penn Treebank
(PTB) Corpus [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is widely used dataset in language modeling tasks in English. The
PTB dataset originally contains one million words from the Wall Street Journal, small
portion of ATIS-3 material and tagged Brown corpus. Then [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] preprocessed this
corpus, divided into training, validation and test sets and restricted the vocabulary size
to 10k words. From then, this version of PTB corpus is widely in language modeling
experiments for all state of the art language modeling experiments. We made our
dataset publicly available for any research purposes. Since there are not so many open
source corpora in Kazakh, we hope that this dataset can be useful in the research
community.
        </p>
        <p>
          Various language modeling experiments were performed with our dataset. We first
tried traditional n-gram based statistical models, after that performed state-of-the-art
Neural Network based language modeling experiments. Neural Network experiments
were conducted by using the LSTM [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] cells. LSTM based neural network with
large parameters showed the best result. We evaluated our language modeling
experiments with the perplexity score, which is a widely used metric to evaluate language
models intrinsically. As the Kazakh language is agglutinative language, word based
language models might have high portion of out of vocabulary (OOV) words on
unseen data. For this reason, we also performed morpheme-based language modeling
experiments. Sub-word based language model is fitted well for Kazakh in both
ngram and neural net models compare to word-based language models.
2
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Data preparation</title>
      <p>We collected the datasets from the websites by using our manual Python scripts,
which uses BeautifulSoup and Request libraries in Python. These collected datasets
were parsed with our scripts on the basis of the HTML structure. The datasets were
crawled from 4 web-pages, whose articles originally written in Kazakh: egemen.kz,
zhasalash.kz, anatili.kazgazeta.kz and baq.kz. These web-pages
mainly contain news articles, historical and literature texts. There are many official
web-pages in Kazakhstan which belong to state bodies and other quasi-governmental
establishments where texts in Kazakh could be collected. However, in many cases,
these web-pages provide the articles, which were translated from the Russian
language. In these web-pages, the news articles at the beginning will be written in
Russian, only then, these articles translated into Kazakh. These kind of datasets might not
possible to</p>
      <p>
        create
well show the inside nature of the Kazakh language, as during the translation, the
structure of the sentences and the use of words changes. We barely see the resistant
phraseological units of Kazakh in these translated articles, instead we might see the
translated version of the phraseological texts in other language. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] studied original
and translated texts in Machine translation, and found out that original texts might be
significantly differing from the original texts. For this reason, we excluded the
webpages which might have translation texts. We choose the web-pages whose texts
originally written in Kazakh. The statistics of datasets is given in Table 1.
After collection of the datasets, we preprocessed the datasets by following [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. First,
all collected datasets were tokenized using Moses [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] script. We added non-breaking
prefixes for Kazakh in Moses, so as not to split the abbreviations. Next preprocessing
steps involved: lowercasing, normalization of punctuations. After normalization of the
punctuations, we removed all punctuation signs. All digits were replaced by a special
sign “N”. We removed all sentences whose length is shorter than 4
words and also duplicate sentences. After these operations, we restricted the
vocabulary size with 10000: we found the most frequent 10000 words and then replaced all
words with ‘&lt;unk&gt;’, which are not in the list of the most frequent
      </p>
      <p>After preprocessing of the datasets, we divided our datasets into training,
validation and testing sets. We tried to follow the size of the Penn Treebank corpus. Since
our datasets were built from the four sources, we tried to split all sources in the same
proportion into training, validation and test sets. Since, the contents in each source
might differ (for example, in egemen.kz there are mostly official news, on the other
hand anatili.kazgazeta.kz contains mainly historic, literature articles), we
avoid having one source as training and others only for testing or validation. For this
reason, we split each source with equal portions. Our datasets divided into training,
validation and test sets on the document level. The statistics about training, validation
and test sets is given in Table 2. Note, overall sentence and word numbers might not
be the sum of all columns, because we exclude the repeated sentences. To compare
the size, at the end, we provide the statistics of the Penn Treebank corpus.
and
longer
than</p>
      <p>80
words.</p>
      <sec id="sec-2-1">
        <title>Sources</title>
        <p>egemen.kz
zhasalash.kz
anatili.kazgazeta.kz
baq.kz
Overall</p>
        <p>
          Penn Tree Bank dataset
3
n-gram based models
(1)
(2)
(3)
The main idea behind the language modeling is to predict hypothesized word
sequences in the sentence with the probabilistic-grmamodmelo. de“lsNpredict the next
word from the previous N-1 words” and it -istokeann seqNuence of words, [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] for
example, if we say two-gram model (or more often it is called a bigram model) it is
two-word sequence such as “Please do”, “do your”, “your homework” and three
model consists of the three-word sequences and so on. As [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] states, in n-gram
model, the model computes the following word from the preceding. The N-gram idea can
be formulated as: given the pervious word sequence and find the probability of the
next words. During the computing of probabilities of the word m-sequences
portant to define the boundaries (punctuation marks such as period, comma, column
or starting of the new sentence from the new line) in order to prevent the search from
being computationally unmanageable.
        </p>
        <p>Formulated mathematically, the goal of a language model is to find the probability of
word sequences, P(w1, …, wn), and it can be estimated by the chain rule of a
probability theory:</p>
        <p>P(w1, …, wn) = P(w1)×P(w2|w1)×…× P(wn|w1, …, wn-1)
There is a notion about history, for example, in the case P(w4|w1, w2, w3), (w1, w2, w3)
considered as the history. This probability is found based on frequency.</p>
        <p>
          We can write the formula for all cases bigram and trigram models as:
gram
it’s
i
This assumption helps to reduce the computation and allows probabilities to be
estimated for a large corpus. Also the assumption probability of the word which depends
on the previous n words (or previous 3 words for a trigram) is called a Markov
assumption. This Markov model [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] assumes that it is possible to predict the
probability of some future cases without looking deeply into the past.
        </p>
        <p>
          By using a Markov assumption, we can find the probability of the sequence of
words by the following formula:
Up to recently, n-gram language models widely used in all language modeling
experiments. In Kazakh, n-gram based language models still used in Speech Processing
[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and Machine translation [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] tasks. We trained n-gram models with the SRILM
toolkit [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] with adding 0 smoothing technique. For our dataset, using of the modified
Kneser-Ney [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] or Katz backoff [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] algorithms showed poor results, (543.63 on the
test set), as there are many infrequent words replaced by ‘&lt;unk&gt;’ sign, and
gram models might work well. Adding 0 smoothing technique showed best
performance for n-gram models. The results are given in Table 3.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Neural LSTM based models</title>
      <p>
        In this experiment, we performed Neural LSTM-based language models. There are
many types of neural architectures, which also applied successfully for the language
modeling tasks. Starting from the work of [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] there are many Recurrent Neural
Architectures proposed. With Recurrent Neural Networks, it’s possible
word sequences, as the recurrence allows to remember the previous word history.
      </p>
      <p>Recurrent Neural Network can directly model the original conditional probabilities:
for bigram model and for trigram:</p>
      <p>P(w1, …, wn) = ∏ (  | 1...  −1) ≈ ∏ (  |  −1)</p>
      <p>≈ ∏(  |  −2  −1)
f can be any nonlinear function such as tanh, ReLU and g can be a softmax
function.</p>
      <p>
        In our work, we followed [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] who presented a simple regularization technique for
Recurrent Neural Networks (RNNs) with LSTM [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] units. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] proposed dropout
technique for regularizing the neural networks, but this technique does not work well
with RNNs. This regularizing technique is tent to have overfitting in many tasks. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
showed that the correctly applied dropout technique to LSTMs might substantially
reduce the overfitting in various tasks. They tested their dropout techniques on
language modeling, speech recognition, machine translation and image caption
generation tasks.
      </p>
      <p>In general, LSTM gates’ equations given as follow:
ft = σ(Wf[Ct-1, ht-1, xt]+bf])
(8)
(4)
(5)
(6)
(7)
only</p>
      <p>high
to
model the
P(w1, …, wn) = ∏ (  | 1...  −1)</p>
      <p>P(w1, …, wn) = gw(ht)
To model the sequences, f function constructed via recursion, initial condition is
given by h0 = 0 and the recursion will be ht=f(xt, ht−1). Hereta, te hotr is called hidden s
memory and it memorizes the history from x1 up to xt−1. Then, the output function is
defined by combination of ht function:</p>
      <p>Then the state values computed by using the above gates:
it = σ(Wi[Ct-1, ht-1, xt]+bi])
ot = σ(Wo[Ct, ht-1, xt]+bo])
gt = tanh(Wg[Ct, ht-1, xt]+bg])
clt = f ⊙ clt-1 + i ⊙ g
hlt = o ⊙ tanh(clt)</p>
      <p>
        (9)
(20)
(31)
(42)
(53)
The dropout method by [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] can be described as follows: if there is a dropout
operator, then it forces the intermediate computation to be more robustly, as the dropout
operator corrupts the information carried by the units. On the other hand, in order not
to erase all the information from the units, the units remember events that occurred
many time steps in the past.
      </p>
      <p>
        We also implement our2 LSTM based Neural Network models using TensorFlow
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. We trained regularized LSTMs of three sizes: the small LSTM, medium LSTM
and large LSTM. Small sized model has two layers and unrolled for 20 steps. Medium
and large LSTMs have two layers and are unrolled for 35 steps. Hidden size differs in
three models: 200, 650 and 1500 for small, medium and large models respectively.
      </p>
      <p>We initialize the hidden states to zero. We then use the final hidden states of the
current minibatch as the initial hidden state of the subsequent minibatch.</p>
      <p>Our experiments showed that the LSTM based neural language modeling
outperforms the n-gram based models. Large and Medium LSTM models shows better
results than the n-gram add 0 smoothing method (Note, for n-gram Kneser-Ney
discounting method we got poor results). Our experiments show that the using of the
Neural based language models have better performance for Kazakh. The results are
given in Table 3.</p>
    </sec>
    <sec id="sec-4">
      <title>Sub-word based language models</title>
      <p>
        In the last section, we experimented with the sub-word based language models. The
Kazakh language as other Turkic languages is an agglutinative language, the word
forms can be obtained by adding the prefixes. This agglutinative nature may lead on
2 https://github.com/Baghdat/LSTM-LM
having the high degree of the out-of-vocabulary (OOV) words on unseen data. To
solve this problem, depending on the characteristics of individual languages, different
language model units were proposed. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] studied different word representations,
such as morphemes, word segmentation based on the Byte Pair Encoding (BPE),
characters and character trigrams. Byte Pair Encoding, proposed by [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], can
effectively handle rare words in Neural Machine Translation and it iteratively replaces the
frequent pairs of characters with a single unused character. Their experiments showed
that for fusional languages (Russian, Czech) and for agglutinative languages (Finnish,
Turkish) character trigram models perform best. Also, [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] considered syllables as the
unit of the language models and tested with different representational models (LSTM,
CNN, summation). As they stated, syllable-aware language models fail to outperform
character-aware ones, but usage of syllabification can increase the training time and
reduce the number of parameters compared to the character-aware language models.
      </p>
      <p>
        By considering these facts, in this section we experimented with the sub-word
based models. Morfessor [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] is a widely tool to split the datasets into morpheme-like
units. It used successfully in many agglutinative languages (Finnish, Turkish,
Estonian). As for now, there is no syllabification tool for Kazakh, we also used Morfessor
tool to split our datasets into morpheme like units.
      </p>
      <p>After splitting the datasets, we performed language modeling experiments on
morpheme like units. The results are given in Table 4. By looking at the results, we can
say that splitting the words into morpheme-like units benefits in terms of OOV and
perplexity in both n-gram and neural net based models.
In this work we created analogy of the Penn TreeBank corpus for the Kazakh
language. To create the corpus, we followed all instructions for preprocessing and the
size of the training, validation and test sets. This dataset is publicly available for the
research purposes. We conducted language modeling experiments on this dataset by
using the traditional n-gram and LSTM based neural networks. We also explored the
sub-word units for the language modeling experiments for Kazakh. Our experiments
showed that neural based models outperform the n-gram based models and splitting
the words into morpheme-like units has advantage compared to the word based
models. In future, we are going to create the hyphenation tool for the Kazakh language, as
Morfessor’s morphem-elike units are data-driven and sometimes there are incorrect
morpheme-like units.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>This work has been funded by the Nazarbayev University under the research grant
No129-2017/022-2017 and by the Committee of Science of the Ministry of Education
and Science of the Republic of Kazakhstan under the research grant AP05134272.</p>
      <p>Toolkit for statistica</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Russell</surname>
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Norvig P. Artificial Intelligence</surname>
            :
            <given-names>A Modern</given-names>
          </string-name>
          <string-name>
            <surname>Approach</surname>
          </string-name>
          (2nd Ed.).
          <source>Pretice Hall</source>
          .
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kukich</surname>
            <given-names>K.</given-names>
          </string-name>
          <article-title>Techniques for automatically correcting words in text</article-title>
          .
          <source>ACM Computing Surveys</source>
          .
          <year>1992</year>
          .
          <volume>24</volume>
          (
          <issue>4</issue>
          ), pp.
          <fpage>377</fpage>
          -
          <lpage>439</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Newell</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langer</surname>
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Hickey</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>The role of natural language processing in alternative and augmentative communication</article-title>
          .
          <source>Natural Language Engineering</source>
          .
          <year>1998</year>
          .
          <volume>4</volume>
          (
          <issue>1</issue>
          ). pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Church</surname>
            <given-names>K.W.</given-names>
          </string-name>
          <article-title>A stochastic parts program and noun phrase parser for unrestricted text</article-title>
          .
          <source>In Proceedings of the Second Conference on Applied Natural Language Processing</source>
          .
          <year>1988</year>
          . pp.
          <fpage>136</fpage>
          -
          <lpage>143</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Brown</surname>
            <given-names>P.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cocke</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>DellaPietra S</surname>
          </string-name>
          .A.,
          <string-name>
            <surname>DellaPietra</surname>
            <given-names>V.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jelinek</surname>
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lafferty</surname>
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mercer</surname>
            <given-names>R.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>and Roossin P.S.</surname>
          </string-name>
          <article-title>A statistical approach to machine translation</article-title>
          .
          <source>Computational Linguistics</source>
          .
          <year>1990</year>
          .
          <volume>16</volume>
          (
          <issue>2</issue>
          ). pp.
          <fpage>79</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hull</surname>
            <given-names>J.J.</given-names>
          </string-name>
          <article-title>Combining syntactic knowledge and visual text recognition: A hidden Markov model for part of speech tagging in a word recognition algorithm</article-title>
          .
          <source>In AAAI Symposium: Probabilistic Approaches to Natural Language</source>
          .
          <year>1992</year>
          . pp.
          <fpage>77</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Whittaker</surname>
            <given-names>E. W. D.</given-names>
          </string-name>
          <article-title>Statistical Language Modelling for Automatic Speech Recognition of Russian and English</article-title>
          .
          <source>PhD thesis</source>
          , Cambridge University, Cambridge.
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Marcus</surname>
            <given-names>M.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcinkiewicz</surname>
            <given-names>M.A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Santorini</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Building</surname>
          </string-name>
          <article-title>a large annotated corpus of English: The penn Treebank</article-title>
          .
          <source>Computational linguistics</source>
          .
          <year>1993</year>
          .
          <volume>19</volume>
          (
          <issue>2</issue>
          ). pp.
          <fpage>313</fpage>
          -
          <lpage>330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mikolov</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kombrink</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burget</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Černocký</surname>
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Khudanpur</surname>
            <given-names>S.</given-names>
          </string-name>
          <article-title>r-Extensions rent neural network language model</article-title>
          .
          <source>In Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE International Conference on.
          <year>2011</year>
          . pp.
          <fpage>5528</fpage>
          -
          <lpage>5531</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hochreiter</surname>
            <given-names>S.</given-names>
          </string-name>
          and
          <article-title>Schmidhuber J. Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          .
          <year>1997</year>
          .
          <volume>9</volume>
          (
          <issue>8</issue>
          ). pp.
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lembersky</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ordan</surname>
            <given-names>N.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Wintner S</surname>
          </string-name>
          .
          <article-title>Language models for machine translation: Original vs</article-title>
          .
          <source>translated texts. Computational Linguistics</source>
          .
          <year>2012</year>
          .
          <volume>38</volume>
          (
          <issue>4</issue>
          ). pp.
          <fpage>799</fpage>
          -
          <lpage>825</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Koehn</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoang</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birch</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callison-Burch</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Federico</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertoldi</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cowan</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moran</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zens</surname>
            <given-names>R.</given-names>
          </string-name>
          and Dyer C.
          <article-title>Moses: Open source toolkit for statistical machine translation</article-title>
          .
          <source>In Proceedings of the 45th annual</source>
          <article-title>meeting of the ACL on interactive poster and demonstration sessions</article-title>
          .
          <source>Association for Computational Linguistics</source>
          .
          <year>2007</year>
          . pp.
          <fpage>177</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Jurafsky</surname>
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Martin</surname>
            <given-names>J. H.</given-names>
          </string-name>
          <string-name>
            <surname>Speech</surname>
          </string-name>
          and
          <string-name>
            <surname>Language Processing</surname>
          </string-name>
          (2nd Ed.).
          <source>Pretice Hall</source>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Markov</surname>
          </string-name>
          <article-title>A.A. Primer statisticheskogo issledovaniya nad tekstom “Evgeniya Onegina”, illyustriruyushchij svyaz' ispytanij v tsep'. [Example of a statistical invets-tigation illustra ing the transitions in the chain for the “Evgneii Onegin” texIt</article-title>
          .z]v.
          <source>estiya Akademii Nauk</source>
          .
          <year>1913</year>
          . pp.
          <fpage>153</fpage>
          -
          <lpage>162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Kozhirbayev</surname>
            <given-names>Zh</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karabalayeva</surname>
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Yessenbayev</given-names>
            <surname>Zh</surname>
          </string-name>
          .
          <article-title>Spoken term detection for Kazakh language</article-title>
          .
          <source>In Proceedings of the 4-th International Conference on Computer Processing of Turkic LanguaTguerskL“ang</source>
          <year>2016</year>
          ”.
          <year>2016</year>
          . p-
          <fpage>p5</fpage>
          .
          <fpage>2</fpage>
          .
          <fpage>47</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Myrzakhmetov</surname>
            <given-names>B.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Makazhanov</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Initial</surname>
          </string-name>
          <article-title>Experiments on Russian to Kazakh SMT</article-title>
          . Research in Computing Science.
          <year>2017</year>
          . vol.
          <volume>117</volume>
          . pp.
          <fpage>153</fpage>
          -
          <lpage>160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Stolcke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>SRILM - an extensible language modeling toolkit</article-title>
          .
          <source>In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP)</source>
          .
          <year>2002</year>
          . pp.
          <fpage>901</fpage>
          -
          <lpage>904</lpage>
          . URL: http://www.speech.sri.com/ projects/srilm/.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Kneser</surname>
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ney H. Improved</surname>
          </string-name>
          backing
          <article-title>-off for m-gram language modeling</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          .
          <year>1995</year>
          . vol.
          <volume>1</volume>
          . pp.
          <fpage>181</fpage>
          -
          <lpage>184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Katz S. M.</surname>
          </string-name>
          <article-title>Estimation of probabilities from sparse data for the language model component of a speech recognizer</article-title>
          .
          <source>IEEE Transactions on Acoustics, Speech and Signal Processing</source>
          .
          <year>1987</year>
          .
          <volume>35</volume>
          (
          <issue>3</issue>
          ). pp.
          <fpage>400</fpage>
          -
          <lpage>401</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Bengio</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ducharme</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vincent</surname>
            <given-names>P.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Jauvin C.</surname>
          </string-name>
          <article-title>A neural probabilistic language model</article-title>
          .
          <source>Journal of machine learning research</source>
          .
          <year>2003</year>
          . pp.
          <fpage>1137</fpage>
          -
          <lpage>1155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Zaremba</surname>
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            <given-names>I.</given-names>
          </string-name>
          and Vinyals O.
          <article-title>Recurrent neural network regularization</article-title>
          .
          <source>arXiv preprint arXiv:1409.2329</source>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Srivastava</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krizhevsky</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            <given-names>I.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Salakhutdinov R. Dropout</surname>
          </string-name>
          <article-title>: a simple way to prevent neural networks from overfitting</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          .
          <year>2014</year>
          .
          <volume>15</volume>
          (
          <issue>1</issue>
          ). pp.
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Abadi</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barham</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Chen</given-names>
            <surname>Zh</surname>
          </string-name>
          .,
          <string-name>
            <surname>Davis</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Devin</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghemawat</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Irving</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isard</surname>
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kudlur M. Tensorflow</surname>
          </string-name>
          <article-title>: a system for large-scale machine learning</article-title>
          .
          <source>In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation</source>
          .
          <source>USENIX Association</source>
          .
          <year>2016</year>
          . pp.
          <fpage>265</fpage>
          -
          <lpage>283</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Vania</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>From</surname>
          </string-name>
          <article-title>Characters to Words to in Between: Do We Capture Morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics</article-title>
          .
          <year>2017</year>
          . Volume
          <volume>1</volume>
          :
          <string-name>
            <given-names>Long</given-names>
            <surname>Papers</surname>
          </string-name>
          . Vol.
          <volume>1</volume>
          , pp.
          <fpage>2016</fpage>
          -
          <lpage>2027</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Sennrich</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Birch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Neural Machine Translation of Rare Words with Subword Units</article-title>
          .
          <source>In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics</source>
          .
          <year>2016</year>
          . Volume
          <volume>1</volume>
          :
          <string-name>
            <given-names>Long</given-names>
            <surname>Papers</surname>
          </string-name>
          . Vol.
          <volume>1</volume>
          , pp.
          <fpage>1715</fpage>
          -
          <lpage>1725</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26. Assylbekov
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Takhanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Myrzakhmetov</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          , &amp; Washington,
          <string-name>
            <surname>J. N.</surname>
          </string-name>
          <article-title>Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones</article-title>
          .
          <source>In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          .
          <year>2017</year>
          . pp.
          <fpage>1866</fpage>
          -
          <lpage>1872</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Smit</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Virpioja</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grönroos</surname>
            <given-names>S. A.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Kurimo</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>Morfessor 2.0l: morphological segmentation. In The 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Gothenburg</article-title>
          , Sweden,
          <source>April 26-30</source>
          ,
          <year>2014</year>
          . Aalto University.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>