<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Extended Language Modeling Experiments for Kazakh</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Bagdat</forename><surname>Myrzakhmetov</surname></persName>
							<email>bagdat.myrzakhmetov@nu.edu.kz</email>
							<affiliation key="aff0">
								<orgName type="laboratory">National Laboratory Astana</orgName>
								<orgName type="institution">Nazarbayev University</orgName>
								<address>
									<postCode>010000</postCode>
									<settlement>Astana</settlement>
									<country key="KZ">Kazakhstan</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">School of Science and Technology</orgName>
								<orgName type="institution">Nazarbayev University</orgName>
								<address>
									<postCode>010000</postCode>
									<settlement>Astana</settlement>
									<country key="KZ">Kazakhstan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Zhanibek</forename><surname>Kozhirbayev</surname></persName>
							<email>zhanibek.kozhirbayev@nu.edu.kz</email>
							<affiliation key="aff0">
								<orgName type="laboratory">National Laboratory Astana</orgName>
								<orgName type="institution">Nazarbayev University</orgName>
								<address>
									<postCode>010000</postCode>
									<settlement>Astana</settlement>
									<country key="KZ">Kazakhstan</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Extended Language Modeling Experiments for Kazakh</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">C457C6F32A6DE7A1C2D434C6E51D42A3</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:14+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Language Modeling</term>
					<term>Kazakh language</term>
					<term>n-gram</term>
					<term>neural language models</term>
					<term>morph-based models</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this article we present dataset for the Kazakh language for the language modeling. It is an analogue of the Penn Treebank dataset for the Kazakh language as we followed all instructions to create it. The main source for our dataset is articles on the web-pages which were primarily written in Kazakh since there are many new articles translated into Kazakh in Kazakhstan. The dataset is publicly available for research purposes 1 . Several experiments were conducted with this dataset. Together with the traditional n-gram models, we created neural network models for the word-based language model (LM). The latter model on the basis of large parameterized long short-term memory (LSTM) shows the best performance. Since the Kazakh language is considered as an agglutinative language and it might have high out-of-vocabulary (OOV) rate on unseen datasets, we also carried on morph-based LM. With regard to experimental results, sub-word based LM is fitted well for Kazakh in both ngram and neural net models compare to word-based LM.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The main task of the language model is to determine whether the particular sequence of words is appropriate or not in some context, determining whether the sequence is accepted or discarded. It is used in various areas such as speech recognition, machine translation, handwriting recognition <ref type="bibr" target="#b0">[1]</ref>, spelling correction <ref type="bibr" target="#b1">[2]</ref>, augmentative communication <ref type="bibr" target="#b2">[3]</ref> and Natural Language Processing tasks (part-of-speech tagging, natural language generation, word similarity, machine translation) <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6]</ref>. Strict rules may be required depending on the task, in which case language models are created by humans and hand constructed networks are used. However, development of the rulebased approaches is difficult and it even requires costly human efforts if large vocabularies are involved. Also usefulness of this approach is limited: in most cases (especially when a large vocabulary used) rules are inflexible and human mostly produces the ungrammatical sequences of words during the speech. One thing, as <ref type="bibr" target="#b6">[7]</ref> states, in most cases the task of language modeling is "to predict how likely the sequence of words is", not to reject or accept as in rule-based language modeling. For that reason, statistical probabilistic language models were developed.</p><p>A large number of word sequences are required to create the language models. Therefore the language model should be able to assign probabilities not only for small amounts of words, but also for the whole sentence. Nowadays it's possible to create large and readable text corpora consisting of millions of words, and language models can be created by using this corpus.</p><p>In this work, we first created the datasets for the language modeling experiments. We built an analogy of the Penn Treebank corpus for the Kazakh language and to do so we followed all preprocessing steps and the corpus sizes. The Penn Treebank (PTB) Corpus <ref type="bibr" target="#b7">[8]</ref> is widely used dataset in language modeling tasks in English. The PTB dataset originally contains one million words from the Wall Street Journal, small portion of ATIS-3 material and tagged Brown corpus. Then <ref type="bibr" target="#b8">[9]</ref> preprocessed this corpus, divided into training, validation and test sets and restricted the vocabulary size to 10k words. From then, this version of PTB corpus is widely in language modeling experiments for all state of the art language modeling experiments. We made our dataset publicly available for any research purposes. Since there are not so many open source corpora in Kazakh, we hope that this dataset can be useful in the research community.</p><p>Various language modeling experiments were performed with our dataset. We first tried traditional n-gram based statistical models, after that performed state-of-the-art Neural Network based language modeling experiments. Neural Network experiments were conducted by using the LSTM <ref type="bibr" target="#b9">[10]</ref> cells. LSTM based neural network with large parameters showed the best result. We evaluated our language modeling experiments with the perplexity score, which is a widely used metric to evaluate language models intrinsically. As the Kazakh language is agglutinative language, word based language models might have high portion of out of vocabulary (OOV) words on unseen data. For this reason, we also performed morpheme-based language modeling experiments. Sub-word based language model is fitted well for Kazakh in both ngram and neural net models compare to word-based language models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Data preparation</head><p>We collected the datasets from the websites by using our manual Python scripts, which uses BeautifulSoup and Request libraries in Python. These collected datasets were parsed with our scripts on the basis of the HTML structure. The datasets were crawled from 4 web-pages, whose articles originally written in Kazakh: egemen.kz, zhasalash.kz, anatili.kazgazeta.kz and baq.kz. These web-pages mainly contain news articles, historical and literature texts. There are many official web-pages in Kazakhstan which belong to state bodies and other quasi-governmental establishments where texts in Kazakh could be collected. However, in many cases, these web-pages provide the articles, which were translated from the Russian language. In these web-pages, the news articles at the beginning will be written in Russian, only then, these articles translated into Kazakh. These kind of datasets might not well show the inside nature of the Kazakh language, as during the translation, the structure of the sentences and the use of words changes. We barely see the resistant phraseological units of Kazakh in these translated articles, instead we might see the translated version of the phraseological texts in other language. <ref type="bibr" target="#b10">[11]</ref> studied original and translated texts in Machine translation, and found out that original texts might be significantly differing from the original texts. For this reason, we excluded the webpages which might have translation texts. We choose the web-pages whose texts originally written in Kazakh. The statistics of datasets is given in Table <ref type="table" target="#tab_0">1</ref>. After collection of the datasets, we preprocessed the datasets by following <ref type="bibr" target="#b8">[9]</ref>. First, all collected datasets were tokenized using Moses <ref type="bibr" target="#b11">[12]</ref> script. We added non-breaking prefixes for Kazakh in Moses, so as not to split the abbreviations. Next preprocessing steps involved: lowercasing, normalization of punctuations. After normalization of the punctuations, we removed all punctuation signs. All digits were replaced by a special sign "N". We removed all sentences whose length is shorter than 4 and longer than 80 words and also duplicate sentences. After these operations, we restricted the vocabulary size with 10000: we found the most frequent 10000 words and then replaced all words with '&lt;unk&gt;', which are not in the list of the most frequent words. After preprocessing of the datasets, we divided our datasets into training, validation and testing sets. We tried to follow the size of the Penn Treebank corpus. Since our datasets were built from the four sources, we tried to split all sources in the same proportion into training, validation and test sets. Since, the contents in each source might differ (for example, in egemen.kz there are mostly official news, on the other hand anatili.kazgazeta.kz contains mainly historic, literature articles), we avoid having one source as training and others only for testing or validation. For this reason, we split each source with equal portions. Our datasets divided into training, validation and test sets on the document level. The statistics about training, validation and test sets is given in Table <ref type="table" target="#tab_1">2</ref>. Note, overall sentence and word numbers might not be the sum of all columns, because we exclude the repeated sentences. To compare the size, at the end, we provide the statistics of the Penn Treebank corpus. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">n-gram based models</head><p>The main idea behind the language modeling is to predict hypothesized word sequences in the sentence with the probabilistic model. "N-gram models predict the next word from the previous N-1 words" and it is an N-token sequence of words, <ref type="bibr" target="#b12">[13]</ref> for example, if we say two-gram model (or more often it is called a bigram model) it is two-word sequence such as "Please do", "do your", "your homework" and three gram model consists of the three-word sequences and so on. As <ref type="bibr" target="#b12">[13]</ref> states, in n-gram model, the model computes the following word from the preceding. The N-gram idea can be formulated as: given the pervious word sequence and find the probability of the next words. During the computing of probabilities of the word sequences it's important to define the boundaries (punctuation marks such as period, comma, column or starting of the new sentence from the new line) in order to prevent the search from being computationally unmanageable. Formulated mathematically, the goal of a language model is to find the probability of word sequences, P(w 1 , …, w n ), and it can be estimated by the chain rule of a probability theory:</p><formula xml:id="formula_0">P(w 1 , …, w n ) = P(w 1 )×P(w 2 |w 1 )×…× P(w n |w 1 , …, w n-1 )<label>(1)</label></formula><p>There is a notion about history, for example, in the case P(w 4 |w 1 , w 2 , w 3 ), (w 1 , w 2 , w 3 ) considered as the history. This probability is found based on frequency. We can write the formula for all cases bigram and trigram models as:</p><formula xml:id="formula_1">P(w i |w 1 ...w i−1 ) ≈ P(w i |w i−1 )<label>(2)</label></formula><formula xml:id="formula_2">P(w i |w 1 ...w i−1 ) ≈ P(w i |w i−2 w i−1 )<label>(3)</label></formula><p>This assumption helps to reduce the computation and allows probabilities to be estimated for a large corpus. Also the assumption probability of the word which depends on the previous n words (or previous 3 words for a trigram) is called a Markov assumption. This Markov model <ref type="bibr" target="#b13">[14]</ref> assumes that it is possible to predict the probability of some future cases without looking deeply into the past. By using a Markov assumption, we can find the probability of the sequence of words by the following formula:</p><formula xml:id="formula_3">P(w 1 , …, w n ) = ∏𝑃(𝑤𝑖|𝑤 1 ...𝑤𝑖 −1 ) ≈ ∏𝑃(𝑤𝑖|𝑤𝑖 −1 )<label>(4)</label></formula><p>for bigram model and for trigram:</p><formula xml:id="formula_4">≈ ∏(𝑤𝑖|𝑤𝑖 −2 𝑤𝑖 −1 )<label>(5)</label></formula><p>Up to recently, n-gram language models widely used in all language modeling experiments. In Kazakh, n-gram based language models still used in Speech Processing <ref type="bibr" target="#b15">[15]</ref> and Machine translation <ref type="bibr" target="#b16">[16]</ref> tasks. We trained n-gram models with the SRILM toolkit <ref type="bibr" target="#b17">[17]</ref> with adding 0 smoothing technique. For our dataset, using of the modified Kneser-Ney <ref type="bibr" target="#b18">[18]</ref> or Katz backoff <ref type="bibr" target="#b19">[19]</ref> algorithms showed poor results, (543.63 on the test set), as there are many infrequent words replaced by '&lt;unk&gt;' sign, and only high gram models might work well. Adding 0 smoothing technique showed best performance for n-gram models. The results are given in Table <ref type="table" target="#tab_2">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Neural LSTM based models</head><p>In this experiment, we performed Neural LSTM-based language models. There are many types of neural architectures, which also applied successfully for the language modeling tasks. Starting from the work of <ref type="bibr" target="#b20">[20]</ref> there are many Recurrent Neural Architectures proposed. With Recurrent Neural Networks, it's possible to model the word sequences, as the recurrence allows to remember the previous word history.</p><p>Recurrent Neural Network can directly model the original conditional probabilities:</p><formula xml:id="formula_5">P(w 1 , …, w n ) = ∏𝑃(𝑤𝑖|𝑤 1 ...𝑤𝑖 −1 )<label>(6)</label></formula><p>To model the sequences, f function constructed via recursion, initial condition is given by h0 = 0 and the recursion will be ht=f(xt, ht−1). Here, ht is called hidden state or memory and it memorizes the history from x1 up to xt−1. Then, the output function is defined by combination of ht function:</p><formula xml:id="formula_6">P(w 1 , …, w n ) = gw(h t ) (7)</formula><p>f can be any nonlinear function such as tanh, ReLU and g can be a softmax function. In our work, we followed <ref type="bibr" target="#b21">[21]</ref> who presented a simple regularization technique for Recurrent Neural Networks (RNNs) with LSTM <ref type="bibr" target="#b9">[10]</ref> units. <ref type="bibr" target="#b22">[22]</ref> proposed dropout technique for regularizing the neural networks, but this technique does not work well with RNNs. This regularizing technique is tent to have overfitting in many tasks. <ref type="bibr" target="#b21">[21]</ref> showed that the correctly applied dropout technique to LSTMs might substantially reduce the overfitting in various tasks. They tested their dropout techniques on language modeling, speech recognition, machine translation and image caption generation tasks.</p><p>In general, LSTM gates' equations given as follow:</p><formula xml:id="formula_7">f t = σ(W f [C t-1 , h t-1 , x t ]+b f ]) (<label>8</label></formula><formula xml:id="formula_8">)</formula><formula xml:id="formula_9">i t = σ(W i [C t-1 , h t-1 , x t ]+b i ]) (<label>9</label></formula><formula xml:id="formula_10">)</formula><formula xml:id="formula_11">o t = σ(W o [C t , h t-1 , x t ]+b o ]) (<label>20</label></formula><formula xml:id="formula_12">)</formula><formula xml:id="formula_13">g t = tanh(W g [C t , h t-1 , x t ]+b g ])<label>(31)</label></formula><p>Then the state values computed by using the above gates:</p><formula xml:id="formula_14">c l t = f ⊙ c l t-1 + i ⊙ g (42) h l t = o ⊙ tanh(c l t )<label>(53)</label></formula><p>The dropout method by <ref type="bibr" target="#b21">[21]</ref> can be described as follows: if there is a dropout operator, then it forces the intermediate computation to be more robustly, as the dropout operator corrupts the information carried by the units. On the other hand, in order not to erase all the information from the units, the units remember events that occurred many time steps in the past. We also implement our<ref type="foot" target="#foot_1">2</ref> LSTM based Neural Network models using TensorFlow <ref type="bibr" target="#b23">[23]</ref>. We trained regularized LSTMs of three sizes: the small LSTM, medium LSTM and large LSTM. Small sized model has two layers and unrolled for 20 steps. Medium and large LSTMs have two layers and are unrolled for 35 steps. Hidden size differs in three models: 200, 650 and 1500 for small, medium and large models respectively. We initialize the hidden states to zero. We then use the final hidden states of the current minibatch as the initial hidden state of the subsequent minibatch.</p><p>Our experiments showed that the LSTM based neural language modeling outperforms the n-gram based models. Large and Medium LSTM models shows better results than the n-gram add 0 smoothing method (Note, for n-gram Kneser-Ney discounting method we got poor results). Our experiments show that the using of the Neural based language models have better performance for Kazakh. The results are given in Table <ref type="table" target="#tab_2">3</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Sub-word based language models</head><p>In the last section, we experimented with the sub-word based language models. The Kazakh language as other Turkic languages is an agglutinative language, the word forms can be obtained by adding the prefixes. This agglutinative nature may lead on having the high degree of the out-of-vocabulary (OOV) words on unseen data. To solve this problem, depending on the characteristics of individual languages, different language model units were proposed. <ref type="bibr" target="#b24">[24]</ref> studied different word representations, such as morphemes, word segmentation based on the Byte Pair Encoding (BPE), characters and character trigrams. Byte Pair Encoding, proposed by <ref type="bibr" target="#b25">[25]</ref>, can effectively handle rare words in Neural Machine Translation and it iteratively replaces the frequent pairs of characters with a single unused character. Their experiments showed that for fusional languages (Russian, Czech) and for agglutinative languages (Finnish, Turkish) character trigram models perform best. Also, <ref type="bibr" target="#b26">[26]</ref> considered syllables as the unit of the language models and tested with different representational models (LSTM, CNN, summation). As they stated, syllable-aware language models fail to outperform character-aware ones, but usage of syllabification can increase the training time and reduce the number of parameters compared to the character-aware language models. By considering these facts, in this section we experimented with the sub-word based models. Morfessor <ref type="bibr" target="#b27">[27]</ref> is a widely tool to split the datasets into morpheme-like units. It used successfully in many agglutinative languages (Finnish, Turkish, Estonian). As for now, there is no syllabification tool for Kazakh, we also used Morfessor tool to split our datasets into morpheme like units.</p><p>After splitting the datasets, we performed language modeling experiments on morpheme like units. The results are given in Table <ref type="table" target="#tab_3">4</ref>. By looking at the results, we can say that splitting the words into morpheme-like units benefits in terms of OOV and perplexity in both n-gram and neural net based models. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In this work we created analogy of the Penn TreeBank corpus for the Kazakh language. To create the corpus, we followed all instructions for preprocessing and the size of the training, validation and test sets. This dataset is publicly available for the research purposes. We conducted language modeling experiments on this dataset by using the traditional n-gram and LSTM based neural networks. We also explored the sub-word units for the language modeling experiments for Kazakh. Our experiments showed that neural based models outperform the n-gram based models and splitting the words into morpheme-like units has advantage compared to the word based models. In future, we are going to create the hyphenation tool for the Kazakh language, as Morfessor's morpheme-like units are data-driven and sometimes there are incorrect morpheme-like units.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Statistics of the dataset: train, validation and test sets shown separately for each source.</figDesc><table><row><cell>Sources</cell><cell cols="2"># of documents # of sentences</cell><cell># of words</cell></row><row><cell>egemen.kz</cell><cell>950/80/71</cell><cell>21751/1551/1839</cell><cell>306415/22452/26790</cell></row><row><cell cols="2">zhasalash.kz 1126/83 /95</cell><cell>8663/694/751</cell><cell>102767/8188/9130</cell></row><row><cell>anatili.kazg</cell><cell>438/32/37</cell><cell>23668/1872/2138</cell><cell>311590/23703/27936</cell></row><row><cell>azeta.kz</cell><cell></cell><cell></cell><cell></cell></row><row><cell>baq.kz</cell><cell>752/72/74</cell><cell>13899/1082/1190</cell><cell>168062/13251/14915</cell></row><row><cell>Overall</cell><cell>3266/267/277</cell><cell>67981/5199/5918</cell><cell>886872/67567/78742</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Statistics about the training, validation and test sets.</figDesc><table><row><cell>Sources</cell><cell>Train set</cell><cell>Validation set</cell><cell>Test set</cell></row><row><cell>egemen.kz</cell><cell>306415</cell><cell>22452</cell><cell>26790</cell></row><row><cell>zhasalash.kz</cell><cell>102767</cell><cell>8188</cell><cell>9130</cell></row><row><cell>anatili.kazgazeta.kz</cell><cell>311590</cell><cell>23703</cell><cell>27936</cell></row><row><cell>baq.kz</cell><cell>168062</cell><cell>13251</cell><cell>14915</cell></row><row><cell>Overall</cell><cell>886872</cell><cell>67567</cell><cell>78742</cell></row><row><cell>Penn Tree Bank dataset</cell><cell>887521</cell><cell>70390</cell><cell>78669</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 .</head><label>3</label><figDesc>Word-based language modeling results.</figDesc><table><row><cell></cell><cell>n-gram</cell><cell></cell><cell>Neural LM</cell><cell></cell></row><row><cell></cell><cell></cell><cell>small</cell><cell>medium</cell><cell>large</cell></row><row><cell>Train ppl</cell><cell>93.81</cell><cell>68.522</cell><cell>67.741</cell><cell>63.185</cell></row><row><cell>Validation ppl</cell><cell>129.6537</cell><cell>143.871</cell><cell>118.875</cell><cell>113.944</cell></row><row><cell>Test ppl</cell><cell>123.7189</cell><cell>144.939</cell><cell>118.783</cell><cell>115.491</cell></row><row><cell>5</cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 .</head><label>4</label><figDesc>Morph-based language modeling results.</figDesc><table><row><cell></cell><cell>n-gram</cell><cell></cell><cell>Neural LM</cell><cell></cell></row><row><cell></cell><cell></cell><cell>small</cell><cell>medium</cell><cell>large</cell></row><row><cell>Train ppl</cell><cell>32.39255</cell><cell>19.599</cell><cell>24.999</cell><cell>25.880</cell></row><row><cell>Validation ppl</cell><cell>44.11561</cell><cell>50.904</cell><cell>41.896</cell><cell>40.876</cell></row><row><cell>Test ppl</cell><cell>44.39559</cell><cell>47.854</cell><cell>38.180</cell><cell>37.556</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://github.com/Baghdat/LSTM-LM/tree/master/data/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://github.com/Baghdat/LSTM-LM</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgement</head><p>This work has been funded by the Nazarbayev University under the research grant No129-2017/022-2017 and by the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan under the research grant AP05134272.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Russell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Norvig</surname></persName>
		</author>
		<title level="m">Artificial Intelligence: A Modern Approach</title>
				<imprint>
			<publisher>Pretice Hall</publisher>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
	<note>2nd Ed</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Techniques for automatically correcting words in text</title>
		<author>
			<persName><forename type="first">K</forename><surname>Kukich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="377" to="439" />
			<date type="published" when="1992">1992</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">The role of natural language processing in alternative and augmentative communication</title>
		<author>
			<persName><forename type="first">A</forename><surname>Newell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Langer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hickey</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Natural Language Engineering</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="16" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A stochastic parts program and noun phrase parser for unrestricted text</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">W</forename><surname>Church</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second Conference on Applied Natural Language Processing</title>
				<meeting>the Second Conference on Applied Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="1988">1988</date>
			<biblScope unit="page" from="136" to="143" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A statistical approach to machine translation</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">F</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cocke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Dellapietra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">J</forename><surname>Dellapietra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Jelinek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Lafferty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">L</forename><surname>Mercer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">S</forename><surname>Roossin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="79" to="85" />
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Combining syntactic knowledge and visual text recognition: A hidden Markov model for part of speech tagging in a word recognition algorithm</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Hull</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">AAAI Symposium: Probabilistic Approaches to Natural Language</title>
				<imprint>
			<date type="published" when="1992">1992</date>
			<biblScope unit="page" from="77" to="83" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Statistical Language Modelling for Automatic Speech Recognition of Russian and English</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">W D</forename><surname>Whittaker</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2000">2000</date>
			<pubPlace>Cambridge</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Cambridge University</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Building a large annotated corpus of English: The penn Treebank</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">P</forename><surname>Marcus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Marcinkiewicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Santorini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational linguistics</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="313" to="330" />
			<date type="published" when="1993">1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Extensions of recurrent neural network language model</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kombrink</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Burget</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Černocký</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Khudanpur</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference on</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="5528" to="5531" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Long short-term memory</title>
		<author>
			<persName><forename type="first">S</forename><surname>Hochreiter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schmidhuber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural computation</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page" from="1735" to="1780" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Language models for machine translation: Original vs. translated texts</title>
		<author>
			<persName><forename type="first">G</forename><surname>Lembersky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ordan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wintner</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="799" to="825" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Moses: Open source toolkit for statistical machine translation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Koehn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hoang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Birch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Callison-Burch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Federico</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Bertoldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Cowan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Moran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Dyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions</title>
				<meeting>the 45th annual meeting of the ACL on interactive poster and demonstration sessions</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="177" to="180" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Speech and Language Processing</title>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Martin</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<publisher>Pretice Hall</publisher>
		</imprint>
	</monogr>
	<note>2nd Ed</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Primer statisticheskogo issledovaniya nad tekstom &quot;Evgeniya Onegina</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Markov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Example of a statistical investigation illustrating the transitions in the chain for the &quot;Evgenii Onegin</title>
				<imprint/>
	</monogr>
	<note>illyustriruyushchij svyaz&apos; ispytanij v tsep. text</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title/>
	</analytic>
	<monogr>
		<title level="j">Izvestiya Akademii Nauk</title>
		<imprint>
			<biblScope unit="page" from="153" to="162" />
			<date type="published" when="1913">1913</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Spoken term detection for Kazakh language</title>
		<author>
			<persName><forename type="first">Kozhirbayev</forename><surname>Zh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Karabalayeva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yessenbayev</forename><surname>Zh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 4-th International Conference on Computer Processing of Turkic Languages &quot;TurkLang 2016</title>
				<meeting>the 4-th International Conference on Computer Processing of Turkic Languages &quot;TurkLang 2016</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="47" to="52" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Initial Experiments on Russian to Kazakh SMT</title>
		<author>
			<persName><forename type="first">B</forename><surname>Myrzakhmetov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Makazhanov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Research in Computing Science</title>
		<imprint>
			<biblScope unit="volume">117</biblScope>
			<biblScope unit="page" from="153" to="160" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">SRILM -an extensible language modeling toolkit</title>
		<author>
			<persName><forename type="first">A</forename><surname>Stolcke</surname></persName>
		</author>
		<ptr target="http://www.speech.sri.com/projects/srilm/" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP)</title>
				<meeting>the 7th International Conference on Spoken Language Processing (ICSLP)</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="901" to="904" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Improved backing-off for m-gram language modeling</title>
		<author>
			<persName><forename type="first">R</forename><surname>Kneser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ney</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing</title>
				<meeting>the IEEE International Conference on Acoustics, Speech and Signal Processing</meeting>
		<imprint>
			<date type="published" when="1995">1995</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="181" to="184" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Estimation of probabilities from sparse data for the language model component of a speech recognizer</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Katz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Acoustics, Speech and Signal Processing</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="400" to="401" />
			<date type="published" when="1987">1987</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">A neural probabilistic language model</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ducharme</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Vincent</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Jauvin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of machine learning research</title>
		<imprint>
			<biblScope unit="page" from="1137" to="1155" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Recurrent neural network regularization</title>
		<author>
			<persName><forename type="first">W</forename><surname>Zaremba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1409.2329.2014</idno>
		<imprint/>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Dropout: a simple way to prevent neural networks from overfitting</title>
		<author>
			<persName><forename type="first">N</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Krizhevsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1929" to="1958" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Tensorflow: a system for large-scale machine learning</title>
		<author>
			<persName><forename type="first">M</forename><surname>Abadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Barham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chen</forename><surname>Zh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Davis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Devin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ghemawat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Irving</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Isard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kudlur</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation</title>
				<meeting>the 12th USENIX conference on Operating Systems Design and Implementation</meeting>
		<imprint>
			<publisher>USENIX Association</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="265" to="283" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">From Characters to Words to in Between: Do We Capture Morphology</title>
		<author>
			<persName><forename type="first">C</forename><surname>Vania</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lopez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics</title>
		<title level="s">Long Papers</title>
		<meeting>the 55th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="2016" to="2027" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Neural Machine Translation of Rare Words with Subword Units</title>
		<author>
			<persName><forename type="first">R</forename><surname>Sennrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Haddow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Birch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics</title>
		<title level="s">Long Papers</title>
		<meeting>the 54th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1715" to="1725" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Assylbekov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Takhanov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Myrzakhmetov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">N</forename><surname>Washington</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2017 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1866" to="1872" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Morfessor 2.0: Toolkit for statistical morphological segmentation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Smit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Virpioja</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Grönroos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kurimo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL)</title>
				<meeting><address><addrLine>Gothenburg, Sweden</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">April 26-30, 2014</date>
		</imprint>
		<respStmt>
			<orgName>Aalto University</orgName>
		</respStmt>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
