<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Conference on Applied Informatics
Eger, Hungary, January</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Automatic Diacritic Restoration With Transformer Model Based Neural Machine Translation for East-Central European Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>László János Laki</string-name>
          <email>laki.laszlo@itk.ppke.hu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zijian Győző Yang</string-name>
          <email>yang.zijian.gyozo@itk.ppke.hu</email>
          <email>yang.zijian.gyozo@uni-eszterhazy.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eszterházy Károly University, Faculty of Informatics</institution>
          ,
          <country country="HU">Hungary</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>MTA-PPKE Hungarian Language Technology Research Group</institution>
          ,
          <country country="HU">Hungary</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Pázmány Péter Catholic University, Faculty of Information Technology and Bionics</institution>
          ,
          <country country="HU">Hungary</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>2</volume>
      <fpage>9</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>In the last few years the size of texts written on mobile devices suddenly increased. People often type messages without diacritic marks, therefore more and more corpora are generated online which contain texts without accents. This fact causes dificulties in natural language processing (NLP) tasks. An accent restore application could be able to clean and prepare the corpus for training data of higher level NLP tools. In our study, we created a diacritic restore method based on the state-ofthe-art neural machine translation (NMT) techniques (transformer model and sentence piece tokenization [SPM]). Our system was tested on 14 languages from the East-Central European region. Most of our systems performance are above 98%. We made deeper analysis with the Hungarian system, where we could reach 99,8% relative accuracy, which is the state-of-the-art solution for Hungarian accent restoration techniques. Furthermore, we created some multilingual models as well, where the restoration engine is able to handles all of the languages. This system has comparable performance to the monolingual ones, despite the fact that it</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>has much less resource requirements. Finally, a online demo was created to
present our application.</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Nowadays huge amount of written text is available on the Internet. Computer
linguists have great opportunity to collect and use these data in their studies in
many areas, such as machine translation, text extraction, or sentiment analysis etc.
However, the highest possible quality of these texts is essential for solving above
tasks eficiently.</p>
      <p>Writing without diacritic symbols became mass phenomenon in the case of the
comment sections of the social media. This type of texts are really important data,
for machine learning algorithms. The commonly used natural language processing
models do not work well in the case of data without accents or incorrect spelling.
With an accent restoration program, we are able to restore misspelled words, which
causes the quality improvement of the text processing algorithms.</p>
      <p>In recent years, the results of neural network-based methods have outperformed
the previous best systems. This is also evident in the field of language
technology, so our aim was to investigate the problem of accent and diacritical character
restoration with the current state-of-the-art NMT-based system.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related works</title>
      <p>
        During recent few years several attempts have been made to restore the accents.
Mihalcea and Nastase [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ] made experiments with language independent machine
learning methods. One of them is when the position and the environment of the
accented letter is taken into consideration. The accuracy of this approach is 95%.
In their another method the distribution of diferent accented words was estimated
from the corpus, and 98% accuracy was reached. However, the disadvantage of
the system is that it is not able to handle those unknown words which are not
represented in the corpus.
      </p>
      <p>
        Charlifter was also looking for a language-independent solution [
        <xref ref-type="bibr" rid="ref13">14</xref>
        ], in which
lexicon-based statistical methods were used to restore the accents. It monitors the
immediate environment and applies a character-based statistical model to handle
unknown words. The accuracy of the system was only 93%. Language-dependent
methods were investigated for Spanish and French by Yarowsky [
        <xref ref-type="bibr" rid="ref16">17</xref>
        ] as well as for
French by Zweigenbaum and Grabar [
        <xref ref-type="bibr" rid="ref17">18</xref>
        ].
      </p>
      <p>
        For Hungarian, Németh et al. [
        <xref ref-type="bibr" rid="ref11">12</xref>
        ] presented a text-to-speech application in
which they handle the words without accents. Morphological and syntactical
analyzers were used to solve the problem. They results achieved 95% accuracy. Ács
and Halmai [19] were created an n-gram based statistical system, which doesn’t use
any kind of language dependent dictionaries. Their reported 98.36% of accuracy for
Hungarian texts. Náplava et al. [
        <xref ref-type="bibr" rid="ref10">11</xref>
        ] published a bi-RNN based system, and they
measured the performance for multiple languages. They reached 99.29% accuracy
for Hungarian diacritic restoration.
      </p>
      <p>
        There are some solutions in which machine translation techniques are used
to solve the task of accent restoration. Novák and Siklósi [
        <xref ref-type="bibr" rid="ref12">13</xref>
        ] restored accents
with statistical machine translation (SMT) methods. Experiments were executed
with and without a morphological analyzer in their SMT systems. They best
result - 99.06% accuracy - was achieved with morphological analyzer. In his BSc.
thesis Nagy [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ] used an RNN-based neural machine translation system to solve
the problem. Its best result achieves 99.5% accuracy. In his work, he performed
BPE (Byte pair encoding) tokenization [
        <xref ref-type="bibr" rid="ref14">15</xref>
        ] using a separate vocabulary for
training sets on the source and target language sides. This study is considered to
be the most similar to our work.
      </p>
      <p>In our research, we use the current “state-of-the-art” transformer model instead
of the RNN model, furthermore Sentence Piece (SPM) tokenization with a common
dictionary instead of BPE. With this technique we are able to increase the accuracy
even more.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Restoration of diacritic words</title>
      <p>The essence of a corpus-based machine translation system is that it performs a
transformation between source and target language sentences, with the help of
parallel corpora. It is an obvious choice to use machine translation techniques to
restore diacritic words, since the sentences with and without accents are almost
grammatically same, they have monotone word order and similar vocabulary.</p>
      <p>Training of the neural network requires a huge amount of training data, which
is very easy to produce for the present task. The training sets were created by
removing the diacritic symbols from a monolingual text.</p>
      <sec id="sec-4-1">
        <title>3.1. Neural machine translation system</title>
        <p>
          Statistical machine translation systems had reached their limits by the first half
of the 2010s. Development of their base methods and the frameworks stopped
despite the lots of invested works made by researchers. The breakthrough step was
brought by [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] system, which was an attention model supported encoder-decoder
architecture based NMT. The essence of the model is the separation of the
translation process into two parts. First part is encoding, where an RNN-based seq2seq
model is created. This model - similarly to the word embedding model - creates an
n-dimensional vector from the models of the source side. This vector corresponds
to the red/dark node in the middle of the Figure 1. The second phase is
decoding, where the system generates the target language sentence from the previously
created sentence vector using an RNN layer.
# of segm
        </p>
        <p># of words
10,193,537
65,902,719
% of
diac
chars
2.55%</p>
        <p>
          From this time the NMT systems took over the domination from SMT. In
2017 a multi-attention NMT system - referred as transformer-based architecture
was published and made accessible by Google LLc. [
          <xref ref-type="bibr" rid="ref15">16</xref>
          ]. The attention model is a
hidden layer between source and target words, and its role is to support the decoder
during the generation of the target words. The essence of transformer method is
that more attention layers are placed into the NMT architecture instead of one,
consequently the quality of the translation of ambiguous words greatly improved.
        </p>
        <p>
          In our study we used the framework Marian NMT [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], which is an open-source
package written in C++. It has been chosen, because it is a well-documented,
memory and resource optimized implementation2, moreover easy to use it. For
these reasons Marian NMT is the most commonly used framework by academic
users as well as commercial developers [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Sentence Piece Tokenizer</title>
        <p>Since the NMT systems are working on GPUs, one of their limitations is the size of
the GPU memory. This factor defines the size of the dictionary that can be created
by NMT. In a word-based system, usually the system is limited to 100K individual
words, thus remaining words are handled as unknown ones.</p>
        <p>
          The problem was solved by reducing the smallest translation unit from word to
subword (word fragments) [
          <xref ref-type="bibr" rid="ref14">15</xref>
          ]. BPE (Byte Pair Encoding) is a data compression
procedure in which the most common byte pairs are replaced by a byte that is not
included in the data itself. The procedure first creates a character-based dictionary
1https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-2/
2https://marian-nmt.github.io/
from the corpus, where all words are represented as character sets. Secondly, the
frequent character sets as stand-alone tokens based on their frequency. This process
not only compresses the data, but also solves handling of unknown words, since
such composition can be also created from subwords, which were originally not
included in the corpus.
        </p>
        <p>
          This method has been further developed by Kudo and Richardson [
          <xref ref-type="bibr" rid="ref6">7</xref>
          ]. They
created an unattended text tokenizer and detokenizer, called Sentence Piece primarily
for neural network-based machine learning tasks. BPE metric is implemented in
it, which is weighted by an unigram language model [6]. By using this system the
language-specific preprocessing steps – such as tokenization or lowercase – are not
needed. The essence of this method is the limitation of the diferent “words” and
elimination of the unknown words in the training set. In this way the number of
the parameters in neural networks can be significantly reduced.
(1)
        </p>
        <p>Plain text: Petőfi Sándor egy nagyszerű költő.</p>
        <p>SPM text: P ető fi S ándor egy nagyszerű
költő .</p>
        <p>The example (1) shows the output of the SPM model. The words of the plain
text are broken into frequently occurring character sequences. It is interesting to
note that the spaces in the original sentences are also attached to the words and
treated as a separate characters ( ).</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Demo interface</title>
        <p>We have created a demo interface3 (see Figure 2) to demonstrate diferent models.
Using a drop-down menu an actual model can be selected, and there is an input
ifeld where the words can be typed. This demo examines the text typed before
spaces, and if that is found to be incorrect, a suggestion is made to correct it
dynamically.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <p>The theoretical base of our work were the newly available transformer and SPM
technologies. Our goal was to improve the quality of diacritic restoration with these
new methods.</p>
      <sec id="sec-5-1">
        <title>4.1. Corpora</title>
        <p>In our work the online available parallel corpus – called Open Subtitles4 – was
used. This corpus contains texts written in 62 languages, consists of film subtitles,
but includes mainly shorter, informal sentences. We tested our system on 14
EastCenter European languages, where Latin alphabet is used. The whole list of the
3http://nlpg.itk.ppke.hu/projects/accent
4http://opus.nlpl.eu/OpenSubtitles-v2018.php
selected languages, their sentence and word statistics, and diacritic symbols are
shown in Table 1. We can see, that some languages are really less resourced, while
most of them has more than 20 million segments. It is quite interesting to examine
the ratio of the diacritic symbols in words and character sets.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. NMT system</title>
        <p>Marian NMT was applied to train and decode the NMT, in which the setup values
were based on its default parameter settings. Against the related publications same
SPM vocabulary were used, which force the system to tokenize source and target
words in the same way. This lead the character coverage on the test corpora 100%.</p>
        <p>The followig parameters were used to train our NMT-TM model:
• Transformer model: Size of vocabulary: 16000; Number of the
encoderdecoder layers: 6; transformer-dropout: 0.1;
• learning rate: 0.0003; lr-warmup: 16000; lr-decay-inv-sqrt: 16000;
• hyperparameters: 0.9 0.98 1e-09; beam-size: 6; normalize: 0.6
• label-smoothing: 0.1; exponential-smoothing</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Trained models</title>
        <p>First of all, we created transformer NMT based diacritic restoration system for all
the 14 languages separately. To create the training data all diacritic symbols were
removed from the words as a first step. Hereinafter this architecture will be referred
as NMT-TM. To train the system we have separated 5000 segment for validation
set and 3000 segment for tests from all of the training corpora. With the validation
set the neural network parameters were optimised during the training. The results
calculated from the separated data will be referred as monolingual models.</p>
        <p>Secondly, two multilingual models were trained. We randomly selected 1-1
million segments from all language resources and mix them to one training corpora.
This model called multi1M. The advantage of this model is the limited training time
and it requires much less hardware resources, but on the other hand it has much
less training data for a specific language. Furthermore the similar languages could
reduce their restoration quality, but in the case of the low resourced languages this
technique could help. To increase the quality of the multi1M system we trained a
similar system, but we inserted the language codes as the first word token into the
beginning of all segments, as a result the quality drop could be eliminated. This
system is called multi1M+lang.</p>
        <p>
          Thirdly, we compared our method with the previous state-of-the-art machine
translation based solutions for Hungarian language case [
          <xref ref-type="bibr" rid="ref7">8</xref>
          ]. For quality
comparison were trained, such as SMT without the morphological analyzer described
by Novák [
          <xref ref-type="bibr" rid="ref12">13</xref>
          ] (SMT) and RNN-based neural machine translation system
(NMTRNN) [
          <xref ref-type="bibr" rid="ref9">10</xref>
          ].
        </p>
        <p>
          To train the SMT system framework Moses [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] was used, where the language
model was created with KenLM [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. During the training default settings of the
system were used, and also the word-binding and rearranging steps were leaved.
These steps are not needed in the case of accent recovery as the word count and
word order is the same on both - source and target - side. The text preprocessing
phase consists of tokenization and “truecase” step. Trucaseing is the process, which
decides if the initial word of the sentence is basically used with lowercase or
uppercase. Accordingly, during the postprocessing detruecase and detokenization steps
are performed. In the case of NMT-RNN MarianNMT was used with s2s model
type trained on BPE tokenized data.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Results and Evaluation</title>
      <p>Precision, recall and absolute accuracy of the word-based results were measured
in our research. Since the originally correct words may change during machine
translation, it is necessary to check the accuracy of the translation for all words
(ALL). Furthermore these metrics were calculated only for the translation of the
words which could contain diacritic characters.</p>
      <p>In Table 1 the quality of the restoration system is shown for all monolingual
models, in which 14 diferent translation systems were trained on monolingual data.
We can see that all of our systems reached higher accuracy than 94% and half of
them are above 99%. Unfortunately, we were not able to check the reference data
of all system manually, since we do not have any language specialist for most of the
languages. Against the huge amount of the Romanian data it reached the lowest
monolingual models
F1 acc
99.29% 99.79%
99.15% 99.78%
99.31% 99.61%
99.20% 99.50%
97.07% 99.31%
98.62% 99.18%
98.54% 99.15%
96.84% 99.08%
97.89% 98.70%
94.88% 98.59%
95.57% 97.70%
91.17% 97.51%
93.32% 95.33%
89.76% 94.97%
multi1M model
F1 acc
98.68% 99.62%
97.91% 99.48%
97.61% 98.75%
94.46% 96.80%
95.86% 99.03%
96.32% 97.84%
97.32% 98.50%
96.14% 98.89%
95.94% 97.61%
99.20% 99.76%
95.10% 97.48%
90.47% 97.34%
94.16% 95.86%
88.20% 94.27%
95.82% 98.16%
quality. During the human evaluation of the training data we found, that more
than 20% of the data were lack of diacritic symbols, which could be the explanation
of this result.</p>
      <p>In the other two columns the qualities of the multilingual systems are listed.
For comparability of the systems the shown results were calculated on the same test
set as the monolingual ones. None of the test sentences were part of the training or
the validation data. We can see that the language code insertion into the beginning
of the segments increases the quality of the multi1M system significantly, and all
of them are comparable with the monolingual ones. It is really interesting to see,
that multi1M+lang system outperforms the quality of the less efective monolingual
ones, which means that it could use some information from other languages. It is
really important to note, that multi1M system is trained on corpora, which includes
only 1M-1M segments from all languages. In the last two rows the multi1M systems
were evaluated on multilingual test sets, which were separated from their own
training data.</p>
      <p>Finally we compared our system with the previous state-of-the-art machine
translation based solutions tested on Hungarian language. During our first
measurement we have found that there are several misspelled words in our reference
sets, so we corrected them manually in the case of the Hungarian one. The results
with this modified test set are shown in Table 3. From this table we can see, that
our system significantly outperforms the previous MT based systems from every
point of view. We found that against SMT and NMT-RNN systems NMT-TM did
SMT
NMT-RNN
NMT-TM</p>
      <p>Error type
hova - hová (where), tied - tiéd (yours)
ref: Érdekelné ez a dolog?
res: Érdekelne ez a dolog?
ref: Különben nem hoznák haza.
res: Különben nem hoznak haza.</p>
      <p>Liúról - Liuról, Ramával - Rámával
még - meg, melyen - mélyen, teli - téli
not replace the correct words (without accent) to incorrectly spelled ones.</p>
      <p>During the deeper analysis we found that from the 18,438 tokens (8,957 type)
only 69 words difer from the reference. When the results were manually evaluated,
we found that more than half of the errors (38 pieces) had equivalent meaning or
correct replaceable form (e.g.: hova-hová (where); tied-tiéd (yours) etc.). The rest
31 words were incorrectly restored indeed; 14 words were foreign proper names and
17 had ambiguous meaning with and without accent. Most of these cases would
need further context for disambiguation. (e.g. 2)
(2) REF: Különben nem hoznák haza. (Otherwise, they may not bring her home.)</p>
      <p>RES: Különben nem hoznak haza. (Otherwise, they will not bring me home.)
The other advantage of the multilingual models is the significant learning time
reduction. On average, one monolingual model training time was 36 hours, which
for 14 languages took about 504 hours. Compare with The multilingual model
training time was 33 hours.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>In this research a diacritic restoration system was created based on the
state-ofthe-art neural machine translation techniques. First of all, this system was trained
on 14 diferent East-Central languages. In most cases our system performs
accuracy over 99%. Secondly, two multilingual models were created, which reduce the
hardware requirements and training time of the system, besides gain comparable
performance. Finally, our method was compared with the existing state-of-the-art
Hungarian accent restoration systems. Our system reaches 99.83% relative
accuracy, which significantly outperforms them. We created a demo site, where the
system can be tried. In the future, we would like to extend our multilingual models
with the remaining Latin based languages, such as German, French, etc.
Acknowledgements. This research was implemented with support provided by
the Artificial Intelligence National Excellence Program (grant no.:
2018-1.2.1-NKP2018-00008).
[6] Kudo, T. Subword regularization: Improving neural network translation models
with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers) (Melbourne,
Australia, July 2018), Association for Computational Linguistics, pp. 66–75.
[19] Ács, J., and Halmi, J. Hunaccent: Small footprint diacritic restoration for social
media. In Proceedings of the Tenth International Conference on Language Resources
and Evaluation (LREC 2016) (Portoroz, Slovenia, 2016), pp. 3526–3529.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>In 3rd International Conference on Learning Representations, ICLR</source>
          <year>2015</year>
          , San Diego, CA, USA, May 7-
          <issue>9</issue>
          ,
          <year>2015</year>
          , Conference Track Proceedings (
          <year>2015</year>
          ),
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          and Y. LeCun, Eds.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Barrault</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojar</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , Costa-jussà,
          <string-name>
            <given-names>M. R.</given-names>
            ,
            <surname>Federmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Fishel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Graham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Haddow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Huck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Koehn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Monz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>MĂźller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Post</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            , and
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Findings of the 2019 conference on machine translation (wmt19)</article-title>
          .
          <source>In Proceedings of the Fourth Conference on Machine Translation (Volume</source>
          <volume>2</volume>
          :
          <string-name>
            <given-names>Shared</given-names>
            <surname>Task</surname>
          </string-name>
          <string-name>
            <surname>Papers</surname>
          </string-name>
          , Day 1) (Florence, Italy,
          <year>August 2019</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Heafield</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>KenLM: faster and smaller language model queries</article-title>
          .
          <source>In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation (Edinburgh</source>
          , Scotland, United Kingdom,
          <year>July 2011</year>
          ), pp.
          <fpage>187</fpage>
          -
          <lpage>197</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Junczys-Dowmunt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grundkiewicz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dwojak</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heafield</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neckermann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seide</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Germann</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aji</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bogoychev</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martins</surname>
            ,
            <given-names>A. F. T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Birch</surname>
          </string-name>
          , A. Marian:
          <article-title>Fast neural machine translation in C++</article-title>
          .
          <source>In Proceedings of ACL</source>
          <year>2018</year>
          ,
          <article-title>System Demonstrations (Melbourne, Australia</article-title>
          ,
          <year>July 2018</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , pp.
          <fpage>116</fpage>
          -
          <lpage>121</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callison-Burch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Federico</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertoldi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cowan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moran</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zens</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojar</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Constantin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Herbst</surname>
          </string-name>
          , E. Moses:
          <article-title>Open source toolkit for statistical machine translation</article-title>
          .
          <source>In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (Stroudsburg</source>
          , PA, USA,
          <year>2007</year>
          ), ACL '07,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, pp.
          <fpage>177</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Kudo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Richardson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing</article-title>
          .
          <source>In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          (Brussels, Belgium, Nov.
          <year>2018</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , pp.
          <fpage>66</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Laki</surname>
            ,
            <given-names>L. J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z. G.</given-names>
          </string-name>
          <article-title>Automatikus ékezetvisszaállítás transzformer modellen alapuló neurális gépi fordítással</article-title>
          .
          <source>XVI. Magyar Számítógépes Nyelvészeti Konferencia</source>
          (
          <year>2020</year>
          ),
          <fpage>181</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Mihalcea</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Nastase</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <article-title>Letter level learning for language independent diacritics restoration</article-title>
          .
          <source>In Proceedings of the 6th Conference on Natural Language Learning</source>
          - Volume
          <volume>20</volume>
          (
          <issue>Stroudsburg</issue>
          , PA, USA,
          <year>2002</year>
          ), COLING-02, Association for Computational Linguistics, pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Nagy</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>Magyar nyelvű zajos szövegek automatikus normalizálása</article-title>
          .
          <source>Master's thesis</source>
          ,
          <source>Pázmány Péter Katolikus Egyetem</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Náplava</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Straka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Straňák</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hajič</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Diacritics restoration using neural networks</article-title>
          .
          <source>In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          )
          <article-title>(Miyazaki</article-title>
          , Japan, May
          <year>2018</year>
          ),
          <article-title>European Language Resources Association (ELRA).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Németh</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zainkó</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fekete</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olaszy</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Endrédi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olaszi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiss</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Kis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>The design, implementation, and operation of a hungarian e-mail reader</article-title>
          .
          <source>International Journal of Speech Technology</source>
          <volume>3</volume>
          ,
          <issue>3</issue>
          (Dec
          <year>2000</year>
          ),
          <fpage>217</fpage>
          -
          <lpage>236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Novák</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Siklósi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>Automatic diacritics restoration for Hungarian</article-title>
          .
          <source>In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing</source>
          (Lisbon, Portugal, Sept.
          <year>2015</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , pp.
          <fpage>2286</fpage>
          -
          <lpage>2291</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Scannell</surname>
            ,
            <given-names>K. P.</given-names>
          </string-name>
          <article-title>Statistical unicodification of african languages</article-title>
          .
          <source>sources and Evaluation</source>
          <volume>45</volume>
          ,
          <issue>3</issue>
          (Jun
          <year>2011</year>
          ),
          <fpage>375</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Sennrich</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Birch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Neural machine translation of rare words with subword units</article-title>
          .
          <source>CoRR abs/1508</source>
          .07909 (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
          </string-name>
          , L. u., and
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          30, I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , and R. Garnett, Eds. Curran Associates, Inc.,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Yarowsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>A Comparison of Corpus-Based Techniques for Restoring Accents in Spanish and French Text</article-title>
          . Springer Netherlands, Dordrecht,
          <year>1999</year>
          , pp.
          <fpage>99</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Zweigenbaum</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Grabar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <article-title>Accenting unknown words in a specialized language</article-title>
          .
          <source>In Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain (Phildadelphia</source>
          , Pennsylvania, USA,
          <year>July 2002</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>