<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Kazakh Text Normalization using Machine Translation Approaches</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>National Laboratory Astana, Nazarbayev University</institution>
          ,
          <addr-line>Nur-Sultan</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present herein our work on text normalization applied to usergenerated content (UGC) in the Kazakh language collected from Kazakhstani segment of Internet. UGC as a text is notoriously difficult to process due to prompt introduction of neologisms, peculiar spelling, code-switching or transliteration. All of this increases lexical variety, thereby aggravating the most prominent problems of NLP, such as out-of-vocabulary lexica and data sparseness. It has been shown that certain preprocessing, known as lexical normalization or simply normalization, is required for them to work properly. We applied machine translation techniques to normalize Kazakh texts. For this, a parallel corpus was created with a set of aligned sentences in canonical and non-canonical forms. Using these comments, we created the phrase-based statistical machine translation system as a baseline system. Furthermore, we applied word-based sequence-sequence model to the normalization task. The former method shows 21.67 BLEUs on the test set, whereas later one obtained approximately 30 BLEU score.</p>
      </abstract>
      <kwd-group>
        <kwd>Text normalization</kwd>
        <kwd>User-generated content</kwd>
        <kwd>Sequence-sequence model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the rise of social media, custom text data has reached unprecedented sizes. As
part of the project on developing tools and algorithms for processing Kazakh
language in the framework of KazNLP project [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], we strive to provide tools for
processing real-world data, including user-generated content (UGC). UGC generally
refers to any type of content created by Internet users including tweets, comments,
dialogues on Internet forums, etc. This type of text is considered difficult to process
due to the high level of noise, i.e. it is far from the standards of the literary language.
Kazakhstani segment of Internet is not except from noisy UGC and the following
cases are the usual suspects in wreaking the “spelling mayhem” [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]:
─ code switching – use of Russian words and expressions in Kazakh text and vice
versa;
─ word transformations, e.g. “керемееет”,“ крмт” instead of “керемет” (great), or
seg-mentation of words, e.g. “к-е-р-е-м-е-т”;
─ the use of emoji, e.g. (, ), and their symbolic counterparts, e.g. [:), : (].
The normalization tool is designed to edit such texts to match the standard language.
All these properties of UGC significantly reduce the accuracy of NLP tools, so in
practice UGC is often normalized, that is, brought to literary language standards.
Consequently, non-canonical text normalization is considered the main preprocessing
stage of almost all NLP tasks [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6 ref7 ref8">3–8</xref>
        ].
      </p>
      <p>This paper is organized as follows: Section 2 presents an overview of various
techniques for the text normalization task. Data collection and annotation are presented in
Section 3. Method description and obtained results of the conducted experiments are
described in Section 4. Summary and conclusions of the performed experiments and
areas of further research are given in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        With the rapid growth of content on social media, text normalization has gained
increasing attention in the past decade, with a focus on converting noisy non-standard
tokens in informal text into standard vocabulary words. Spell checking plays an
important role in this process as it can be seen as an initial attempt at text normalization.
In [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9–11</xref>
        ], it was proposed to use a framework with noisy channels to generate a list of
corrections for any misspelled word, ranked according to the corresponding posterior
probabilities.
      </p>
      <p>
        The work of [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] refined this structure by computing the likelihood function as a
noisy token and its associated tag would be generated by a specific word. However,
spell-checking algorithms are in most cases ineffective for this type of data because
they do not account for phenomena in informal text. For example, some previous
work [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] has instead focused on sporadic typographical errors using edit distance
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] in conjunction with modeling pronunciation.
      </p>
      <p>
        The work [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] used a noisy channel model based on spell editing distance using the
web to generate a large set of automatically generated (noisy) pairs that will be used
for training and for spelling suggestions. Even though they use the Web for gathering,
they do not focus on informal text, but rather unintentional misspellings. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
combined the noisy channel model with a rule-based final transformer and obtained
acceptable results for French SMS. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] used weighted finite state machines (FSM) and
rewrite rules to normalize French SMS; [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] focused on tweets generated with mobile
phones and developed a CRF tagger for deletion-based reduction.
      </p>
      <p>
        Recent work has also focused on normalizing Twitter messages, which is generally
considered a more challenging task. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] developed classifiers to detect malformed
words and generated corrections based on morphophonic similarities. [
        <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
        ]
proposed to normalize non-standard tokens without explicitly categorizing them.
      </p>
      <p>
        The above approaches rely almost heavily on external linguistic resources and
manually defined rules. A wide range of NLP tasks shows promising results using
neural networks. The encoder-decoder architectures [
        <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
        ] exceeded expectations in
machine translation [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], dialogue generation [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], summarization [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], question
answering [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Hence, it makes sense to wonder if the Seq2Seq models are suitable
for the normalization task.
      </p>
      <p>
        Work of [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] applied the encoder-decoder architecture that uses Recurrent Neural
Network (RNN) based on Gated Recurrent Unit (GRU) for Japanese text
normalization. They improved the performance of Japanese text normalization by performing a
stable training of the encoder-decoder model with a new method for data
augmentation. [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] applied a normalization method based on word-character attention-based
encoder-decoder model on noisy text in social media. They state that the presented
character-based component, which is trained on synthetic adversarial examples,
shows a significant result. [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] normalized Swiss German WhatsApp messages using
the encoder-decoder model. They argue that the flexibility of the encoder-decoder
model provides for using same training data in different ways. Particularly, the
modification was made in the part of decoding by introducing different levels of
granularity in the language of the target side: characters and words. [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] explored the
possibilities of using machine translation techniques to normalize noisy Turkish texts. They
trained character-based translation model with synthetic parallel data. The
experiments were conducted both on statistical and neural machine translation methods to
compare the obtained results.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data collection and annotation</title>
      <p>Like most machine learning models, machine translation methods require training
data to produce meaningful results. Parallel text corpuses are a structured set of
translated texts between two languages. Such parallel corpora are essential for training
machine translation algorithms. In our case, the source side is the unprocessed
comments, and the target side is the revised comments by annotators. To create a corpus
for this task, at the beginning we collected comments from news web pages: nur.kz,
tengrinews.kz and zakon.kz. The comments were divided into language groups:
Kazakh, Russian and mixed. Perfect comments that do not contain errors have been
removed as there is no point in giving correct comments on our machine translation
approach. To sort the ideal comments, we used texts from the official news web
pages, in which we believe there are no errors. We compare them, if all the words of our
comments are in this text, then this comment is considered ideal. Some comments
may contain multiple sentences, so we split longer comments into multiple sentences.</p>
      <p>The statistics are shown in Table 1.
Total
Beside comments from news portals, additional comments were collected from social
networks. The Kazakh-speaking audience of commentators is most active on social
networks like Facebook and Instagram. The analysis shows the main share of
comments in the Kazakh language falls on such Facebook groups as OnlineQazaqstan
(367,920 members at the time of analysis), newspaper «Қала мен Дала»
(97,685 members at the time of analysis). Based on the above, the Kazakh-speaking
segment of the social network Facebook was selected for the source of collecting text
data. The statistics of the dataset from social media is presented in Table 2.
After collecting comments, we built a parallel corpus. The source side of the corpus
contains comments, and the target side contains revised version of the corresponding
comment. A web interface has been built to fix comments and make annotation easier.
We had two annotators and one last controller-moderator. Fig. 1 shows a screenshot
from the web interface. Here, annotators can select the source of the correction, and
can also observe the work done in general. The controller-moderator can correct the
work of the annotators and approve.
After the annotation process, the datasets were further processed. Some very long
comments are split into several parts, mostly by sentence. Comments in Russian were
removed. After these preprocessing, 27005 comments remained. We used 90% of
these comments for training and the rest for testing. Statistics are shown in Table 3.</p>
    </sec>
    <sec id="sec-4">
      <title>Method description and results</title>
      <p>In this project, we explored the potential of using machine translation methods to
normalize non-canonical texts in Kazakh. Therefore, we conducted both statistical
machine translation (SMT) and neural machine translation (NMT) approaches in
order to compare the results.</p>
      <p>The SMT method was chosen as a baseline experiment. A pretty standard set of
tools was used in this pipeline. The plan was to build scalable NLP tools in Python, so
we built a phrase-based statistical machine translation system, since among the
various methods, phrase-based methods have shown high performance. We used n-gram
language models, in particular 3-gram models. The decoding process was
implemented using the beam search stack decoding algorithm.</p>
      <p>Inspired by advances in NMT, we applied end-to-end neural network models, in
particular sequence-sequence (Seq2Seq) models to the secondary normalization task.
Seq2Seq models have ability to convert sequences from one domain (e.g. sentences in
non-canonical form) to sequences in another domain (e.g. the same sentences in
canonical form). Its feature to capture any useful contextual information in a sentence
can be used in text normalization task. This eliminates the need for language-specific
tools except the sufficient training data.</p>
      <p>
        We built our Seq2Seq model using the Keras library [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. Firstly, the combination
of the train and test datasets was used to define the maximum length and vocabulary
of the problem. We map words to integers, as needed for modeling. Separate
tokenizer was used for the source sequences and the target sequences. Each input and output
sequence must be encoded to integers and padded to the maximum phrase length,
since a word embedding was used for input sequences and one-hot encoding for
output sequences.
      </p>
      <p>We use an encoder-decoder Long Short Term Memory (LSTM) networks model
on this problem. In this architecture with 2-layer LSTM encoders and decoders, the
input sequence is encoded by a front-end model called the encoder then decoded word
by word by a backend model called the decoder. The model is trained using the
efficient Adam approach to stochastic gradient descent and minimizes the categorical loss
function because we have framed the prediction problem as multi-class classification.</p>
      <p>
        To assess the quality of translation, we used a widely used measurement – BLEU
(Bilingual Evaluation Understudy) [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. The main idea behind this metric is to
determine the n-gram match between the translated candidate and the link. After
translaModel
SMT
NMT
5
tion, we compared our translated test set with an original test set. This result can be
viewed in Table 4.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this work, we used machine translation approaches to normalize comments. To
create machine translation systems, we first collected comments from news portals
and social networks. We then corrected these comments with annotators. A total of
27005 comments were collected and corrected. The original raw comments are treated
as the source side, and the revised comments are treated as the target side in our
parallel corpus. Using these comments, we have created the phrase-based statistical
machine translation system as a baseline system. Furthermore, we applied word-based
sequence-sequence models to the secondary normalization task in order to compare
statistical and neural network approaches. The statistical method shows 21.67 BLEUs
on the test set, whereas sequence-sequence model obtained approximately 30 BLEU
score. The later technique improves the performance of the normalization task
significantly. In average, the both results can be viewed as an average performance. The
reason for this phenomenon may be related to sparse datasets. To solve this problem,
in the future we are going to add more comments to our parallel dataset. Moreover,
we will conduct experiments with sequence to sequence models with attention
mechanism as well as character-based models.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>This work has been funded by the Ministry of Education and Science of the Republic
of Kazakhstan under the research grants No. AP05134272 and No. AP08053085.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Yessenbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozhirbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makazhanov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>KazNLP: A Pipeline for Automated Processing of Texts Written in Kazakh Language</article-title>
          . In: Karpov A.,
          <string-name>
            <surname>Potapova</surname>
            <given-names>R</given-names>
          </string-name>
          . (Eds.):
          <source>SPECOM</source>
          <year>2020</year>
          , LNAI,
          <volume>12335</volume>
          ,
          <fpage>657</fpage>
          -
          <lpage>666</lpage>
          . Springer Nature Switzerland AG (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Makazhanov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yessenbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozhirbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>KazNLP: NLP Tools for Kazakh Language</article-title>
          . https://github.com/nlacslab/kaznlp, last accessed
          <year>2020</year>
          /10/07.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kozhirbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yessenbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makazhanov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Document and word-level language identification for noisy user generated text</article-title>
          .
          <source>In: 12th IEEE International Conference on Application of Information and Communication Technologies (AICT2014)</source>
          ,
          <fpage>124</fpage>
          -
          <lpage>127</lpage>
          .
          <string-name>
            <surname>Almaty</surname>
          </string-name>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Makazhanov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Myrzakhmetov</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozhirbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>On Various Approaches to Machine Translation from Russian to Kazakh</article-title>
          .
          <source>In: 5th International Conference on Turkic Languages Processing (TurkLang</source>
          <year>2017</year>
          ),
          <fpage>195</fpage>
          -
          <lpage>209</lpage>
          .
          <string-name>
            <surname>Kazan</surname>
          </string-name>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zulkhazhav</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozhirbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yessenbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharipbay</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Kazakh text summarization using fuzzy logic</article-title>
          .
          <source>Computacion y Sistemas</source>
          ,
          <volume>23</volume>
          (
          <issue>3</issue>
          ),
          <fpage>851</fpage>
          -
          <lpage>859</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Myrzakhmetov</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozhirbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Extended language modeling experiments for Kazakh</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          ,
          <volume>2303</volume>
          ,
          <fpage>42</fpage>
          -
          <lpage>52</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kozhirbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yessenbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karabalayeva</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Kazakh and Russian languages identification using long short-term memory recurrent neural networks</article-title>
          .
          <source>In: 11th IEEE International Conference on Application of Information and Communication Technologies</source>
          ,
          <fpage>342</fpage>
          -
          <lpage>347</lpage>
          . Moscow (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kozhirbayev</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erol</surname>
            ,
            <given-names>B. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharipbay</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jamshidi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Speaker recognition for robotic control via an IoT device</article-title>
          . In: World Automation Congress,
          <fpage>259</fpage>
          -
          <lpage>264</lpage>
          . Stevenson, Washington (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Church</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gale</surname>
            ,
            <given-names>W.A.</given-names>
          </string-name>
          :
          <article-title>Probability scoring for spelling correction</article-title>
          .
          <source>Statistics and Computing</source>
          ,
          <volume>1</volume>
          (
          <issue>2</issue>
          ),
          <fpage>93</fpage>
          -
          <lpage>103</lpage>
          (
          <year>1991</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mays</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Damerau</surname>
            ,
            <given-names>F. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mercer</surname>
            ,
            <given-names>R. L.</given-names>
          </string-name>
          :
          <article-title>Context based spelling correction</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>27</volume>
          (
          <issue>5</issue>
          ),
          <fpage>517</fpage>
          -
          <lpage>522</lpage>
          (
          <year>1991</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Brill</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>R. C.</given-names>
          </string-name>
          :
          <article-title>An improved error model for noisy channel spelling correction</article-title>
          .
          <source>In: Proceedings of the 38th annual meeting of the association for computational linguistics</source>
          ,
          <volume>286</volume>
          -
          <fpage>293</fpage>
          . Hong Kong (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Sproat</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Black</surname>
            ,
            <given-names>A. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ostendorf</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richards</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Normalization of non-standard words</article-title>
          .
          <source>Computer speech &amp; language</source>
          ,
          <volume>15</volume>
          (
          <issue>3</issue>
          ),
          <fpage>287</fpage>
          -
          <lpage>333</lpage>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>R. C.</given-names>
          </string-name>
          :
          <article-title>Pronunciation modeling for improved spelling correction</article-title>
          .
          <source>In: 40th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          ,
          <fpage>144</fpage>
          -
          <lpage>151</lpage>
          . Philadelphia (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kukich</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Techniques for automatically correcting words in text</article-title>
          .
          <source>Acm Computing Surveys (CSUR)</source>
          ,
          <volume>24</volume>
          (
          <issue>4</issue>
          ),
          <fpage>377</fpage>
          -
          <lpage>439</lpage>
          (
          <year>1992</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Whitelaw</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutchinson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chung</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ellis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Using the web for language independent spellchecking and autocorrection</article-title>
          .
          <source>In: 2009 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <fpage>890</fpage>
          -
          <lpage>899</lpage>
          .
          <string-name>
            <surname>Singapore</surname>
          </string-name>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Beaufort</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roekhaut</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cougnon</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fairon</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A hybrid rule/model-based finitestate framework for normalizing SMS messages</article-title>
          .
          <source>In: 48th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <fpage>770</fpage>
          -
          <lpage>779</lpage>
          .
          <string-name>
            <surname>Uppsala</surname>
          </string-name>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Choudhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saraf</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mukherjee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarkar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Investigation and modeling of the structure of texting language International Journal of Document Analysis and Recognition (IJDAR</article-title>
          ),
          <volume>10</volume>
          (
          <issue>3-4</issue>
          ),
          <fpage>157</fpage>
          -
          <lpage>174</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Pennell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Normalization of text messages for text-to-speech</article-title>
          .
          <source>In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          ,
          <fpage>4842</fpage>
          -
          <lpage>4845</lpage>
          . Dallas (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Lexical normalisation of short text messages: Makn sens a #twitter. In: 49th annual meeting of the association for computational linguistics: Human language technologies</article-title>
          ,
          <fpage>368</fpage>
          -
          <lpage>378</lpage>
          . Portland,
          <string-name>
            <surname>Oregon</surname>
          </string-name>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Insertion, deletion, or substitution? Normalizing text messages without pre-categorization nor supervision</article-title>
          .
          <source>In: 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <fpage>71</fpage>
          -
          <lpage>76</lpage>
          . Portland,
          <string-name>
            <surname>Oregon</surname>
          </string-name>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Gouws</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Metzler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          , E.:
          <article-title>Contextual bearing on linguistic variation in social media</article-title>
          .
          <source>In: Proceedings of the workshop on language in social media (LSM)</source>
          ,
          <fpage>20</fpage>
          -
          <lpage>29</lpage>
          . Portland,
          <string-name>
            <surname>Oregon</surname>
          </string-name>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>Sequence to sequence learning with neural networks</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          ,
          <volume>3104</volume>
          -
          <fpage>3112</fpage>
          . Montreal (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Merriënboer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulcehre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bougares</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwenk</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Learning phrase representations using RNN encoder-decoder for statistical machine translation</article-title>
          .
          <source>arXiv:1406.1078</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Norouzi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macherey</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krikun</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macherey</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Google's neural machine translation system: Bridging the gap between human and machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1609</source>
          .
          <fpage>08144</fpage>
          . (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>A neural conversational model</article-title>
          .
          <source>arXiv:1506.05869</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Nallapati</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulcehre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Abstractive text summarization using sequence-to-sequence rnns and beyond</article-title>
          .
          <source>arXiv:1602.06023</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Neural generative question answering</article-title>
          .
          <source>arXiv:1512.01337</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Ikeda</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shindo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matsumoto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Japanese text normalization with encoder-decoder model</article-title>
          .
          <source>In: Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</source>
          ,
          <fpage>129</fpage>
          -
          <lpage>137</lpage>
          .
          <string-name>
            <surname>Osaka</surname>
          </string-name>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Lourentzou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manghnani</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Adapting sequence to sequence models for text normalization in social media</article-title>
          .
          <source>In: International AAAI Conference on Web and Social Media</source>
          ,
          <fpage>335</fpage>
          -
          <lpage>345</lpage>
          .
          <string-name>
            <surname>Munich</surname>
          </string-name>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Lusetti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruzsics</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Göhring</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Samardžić</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stark</surname>
          </string-name>
          , E.:
          <article-title>Encoder-decoder methods for text normalization</article-title>
          .
          <source>In: Fifth Workshop on NLP for Similar Languages, Varieties and Dialects. Santa Fe</source>
          , New Mexico (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Çolakoğlu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sulubacak</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tantuğ</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          :
          <article-title>Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches</article-title>
          .
          <source>In: 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop</source>
          ,
          <volume>267</volume>
          -
          <fpage>272</fpage>
          .
          <string-name>
            <surname>Florence</surname>
          </string-name>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Deep learning mit Python und Keras: Das Praxis-Handbuch vom Entwickler der Keras-Bibliothek.</article-title>
          <string-name>
            <surname>MITP-Verlags</surname>
            <given-names>GmbH</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Co. KG</surname>
          </string-name>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Papineni</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roukos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>W.J.:</given-names>
          </string-name>
          <article-title>BLEU: a method for automatic evaluation of machine translation</article-title>
          .
          <source>In: 40th annual meeting of the Association for Computational Linguistics</source>
          ,
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . Philadelphia (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>