-

International Conference on Applied Informatics Eger, Hungary, January

Automatic Diacritic Restoration With Transformer Model Based Neural Machine Translation for East-Central European Languages

László János Laki

laki.laszlo@itk.ppke.hu 1 2

Zijian Győző Yang

yang.zijian.gyozo@itk.ppke.hu yang.zijian.gyozo@uni-eszterhazy.hu 0 1 2 0 Eszterházy Károly University, Faculty of Informatics , Hungary 1 MTA-PPKE Hungarian Language Technology Research Group , Hungary 2 Pázmány Péter Catholic University, Faculty of Information Technology and Bionics , Hungary

2020

2 9 31

In the last few years the size of texts written on mobile devices suddenly increased. People often type messages without diacritic marks, therefore more and more corpora are generated online which contain texts without accents. This fact causes dificulties in natural language processing (NLP) tasks. An accent restore application could be able to clean and prepare the corpus for training data of higher level NLP tools. In our study, we created a diacritic restore method based on the state-ofthe-art neural machine translation (NMT) techniques (transformer model and sentence piece tokenization [SPM]). Our system was tested on 14 languages from the East-Central European region. Most of our systems performance are above 98%. We made deeper analysis with the Hungarian system, where we could reach 99,8% relative accuracy, which is the state-of-the-art solution for Hungarian accent restoration techniques. Furthermore, we created some multilingual models as well, where the restoration engine is able to handles all of the languages. This system has comparable performance to the monolingual ones, despite the fact that it

has much less resource requirements. Finally, a online demo was created to present our application.

1. Introduction

Nowadays huge amount of written text is available on the Internet. Computer linguists have great opportunity to collect and use these data in their studies in many areas, such as machine translation, text extraction, or sentiment analysis etc. However, the highest possible quality of these texts is essential for solving above tasks eficiently.

Writing without diacritic symbols became mass phenomenon in the case of the comment sections of the social media. This type of texts are really important data, for machine learning algorithms. The commonly used natural language processing models do not work well in the case of data without accents or incorrect spelling. With an accent restoration program, we are able to restore misspelled words, which causes the quality improvement of the text processing algorithms.

In recent years, the results of neural network-based methods have outperformed the previous best systems. This is also evident in the field of language technology, so our aim was to investigate the problem of accent and diacritical character restoration with the current state-of-the-art NMT-based system.

2. Related works

During recent few years several attempts have been made to restore the accents. Mihalcea and Nastase [ 9 ] made experiments with language independent machine learning methods. One of them is when the position and the environment of the accented letter is taken into consideration. The accuracy of this approach is 95%. In their another method the distribution of diferent accented words was estimated from the corpus, and 98% accuracy was reached. However, the disadvantage of the system is that it is not able to handle those unknown words which are not represented in the corpus.

Charlifter was also looking for a language-independent solution [ 14 ], in which lexicon-based statistical methods were used to restore the accents. It monitors the immediate environment and applies a character-based statistical model to handle unknown words. The accuracy of the system was only 93%. Language-dependent methods were investigated for Spanish and French by Yarowsky [ 17 ] as well as for French by Zweigenbaum and Grabar [ 18 ].

For Hungarian, Németh et al. [ 12 ] presented a text-to-speech application in which they handle the words without accents. Morphological and syntactical analyzers were used to solve the problem. They results achieved 95% accuracy. Ács and Halmai [19] were created an n-gram based statistical system, which doesn’t use any kind of language dependent dictionaries. Their reported 98.36% of accuracy for Hungarian texts. Náplava et al. [ 11 ] published a bi-RNN based system, and they measured the performance for multiple languages. They reached 99.29% accuracy for Hungarian diacritic restoration.

There are some solutions in which machine translation techniques are used to solve the task of accent restoration. Novák and Siklósi [ 13 ] restored accents with statistical machine translation (SMT) methods. Experiments were executed with and without a morphological analyzer in their SMT systems. They best result - 99.06% accuracy - was achieved with morphological analyzer. In his BSc. thesis Nagy [ 10 ] used an RNN-based neural machine translation system to solve the problem. Its best result achieves 99.5% accuracy. In his work, he performed BPE (Byte pair encoding) tokenization [ 15 ] using a separate vocabulary for training sets on the source and target language sides. This study is considered to be the most similar to our work.

In our research, we use the current “state-of-the-art” transformer model instead of the RNN model, furthermore Sentence Piece (SPM) tokenization with a common dictionary instead of BPE. With this technique we are able to increase the accuracy even more.

3. Restoration of diacritic words

The essence of a corpus-based machine translation system is that it performs a transformation between source and target language sentences, with the help of parallel corpora. It is an obvious choice to use machine translation techniques to restore diacritic words, since the sentences with and without accents are almost grammatically same, they have monotone word order and similar vocabulary.

Training of the neural network requires a huge amount of training data, which is very easy to produce for the present task. The training sets were created by removing the diacritic symbols from a monolingual text.

3.1. Neural machine translation system

Statistical machine translation systems had reached their limits by the first half of the 2010s. Development of their base methods and the frameworks stopped despite the lots of invested works made by researchers. The breakthrough step was brought by [ 1 ] system, which was an attention model supported encoder-decoder architecture based NMT. The essence of the model is the separation of the translation process into two parts. First part is encoding, where an RNN-based seq2seq model is created. This model - similarly to the word embedding model - creates an n-dimensional vector from the models of the source side. This vector corresponds to the red/dark node in the middle of the Figure 1. The second phase is decoding, where the system generates the target language sentence from the previously created sentence vector using an RNN layer. # of segm

# of words 10,193,537 65,902,719 % of diac chars 2.55%

From this time the NMT systems took over the domination from SMT. In 2017 a multi-attention NMT system - referred as transformer-based architecture was published and made accessible by Google LLc. [ 16 ]. The attention model is a hidden layer between source and target words, and its role is to support the decoder during the generation of the target words. The essence of transformer method is that more attention layers are placed into the NMT architecture instead of one, consequently the quality of the translation of ambiguous words greatly improved.

In our study we used the framework Marian NMT [ 4 ], which is an open-source package written in C++. It has been chosen, because it is a well-documented, memory and resource optimized implementation2, moreover easy to use it. For these reasons Marian NMT is the most commonly used framework by academic users as well as commercial developers [ 2 ].

3.2. Sentence Piece Tokenizer

Since the NMT systems are working on GPUs, one of their limitations is the size of the GPU memory. This factor defines the size of the dictionary that can be created by NMT. In a word-based system, usually the system is limited to 100K individual words, thus remaining words are handled as unknown ones.

The problem was solved by reducing the smallest translation unit from word to subword (word fragments) [ 15 ]. BPE (Byte Pair Encoding) is a data compression procedure in which the most common byte pairs are replaced by a byte that is not included in the data itself. The procedure first creates a character-based dictionary 1https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-2/ 2https://marian-nmt.github.io/ from the corpus, where all words are represented as character sets. Secondly, the frequent character sets as stand-alone tokens based on their frequency. This process not only compresses the data, but also solves handling of unknown words, since such composition can be also created from subwords, which were originally not included in the corpus.

This method has been further developed by Kudo and Richardson [ 7 ]. They created an unattended text tokenizer and detokenizer, called Sentence Piece primarily for neural network-based machine learning tasks. BPE metric is implemented in it, which is weighted by an unigram language model [6]. By using this system the language-specific preprocessing steps – such as tokenization or lowercase – are not needed. The essence of this method is the limitation of the diferent “words” and elimination of the unknown words in the training set. In this way the number of the parameters in neural networks can be significantly reduced. (1)

Plain text: Petőfi Sándor egy nagyszerű költő.

SPM text: P ető fi S ándor egy nagyszerű költő .

The example (1) shows the output of the SPM model. The words of the plain text are broken into frequently occurring character sequences. It is interesting to note that the spaces in the original sentences are also attached to the words and treated as a separate characters ( ).

3.3. Demo interface

We have created a demo interface3 (see Figure 2) to demonstrate diferent models. Using a drop-down menu an actual model can be selected, and there is an input ifeld where the words can be typed. This demo examines the text typed before spaces, and if that is found to be incorrect, a suggestion is made to correct it dynamically.

4. Experiments

The theoretical base of our work were the newly available transformer and SPM technologies. Our goal was to improve the quality of diacritic restoration with these new methods.

4.1. Corpora

In our work the online available parallel corpus – called Open Subtitles4 – was used. This corpus contains texts written in 62 languages, consists of film subtitles, but includes mainly shorter, informal sentences. We tested our system on 14 EastCenter European languages, where Latin alphabet is used. The whole list of the 3http://nlpg.itk.ppke.hu/projects/accent 4http://opus.nlpl.eu/OpenSubtitles-v2018.php selected languages, their sentence and word statistics, and diacritic symbols are shown in Table 1. We can see, that some languages are really less resourced, while most of them has more than 20 million segments. It is quite interesting to examine the ratio of the diacritic symbols in words and character sets.

4.2. NMT system

Marian NMT was applied to train and decode the NMT, in which the setup values were based on its default parameter settings. Against the related publications same SPM vocabulary were used, which force the system to tokenize source and target words in the same way. This lead the character coverage on the test corpora 100%.

The followig parameters were used to train our NMT-TM model: • Transformer model: Size of vocabulary: 16000; Number of the encoderdecoder layers: 6; transformer-dropout: 0.1; • learning rate: 0.0003; lr-warmup: 16000; lr-decay-inv-sqrt: 16000; • hyperparameters: 0.9 0.98 1e-09; beam-size: 6; normalize: 0.6 • label-smoothing: 0.1; exponential-smoothing

4.3. Trained models

First of all, we created transformer NMT based diacritic restoration system for all the 14 languages separately. To create the training data all diacritic symbols were removed from the words as a first step. Hereinafter this architecture will be referred as NMT-TM. To train the system we have separated 5000 segment for validation set and 3000 segment for tests from all of the training corpora. With the validation set the neural network parameters were optimised during the training. The results calculated from the separated data will be referred as monolingual models.

Secondly, two multilingual models were trained. We randomly selected 1-1 million segments from all language resources and mix them to one training corpora. This model called multi1M. The advantage of this model is the limited training time and it requires much less hardware resources, but on the other hand it has much less training data for a specific language. Furthermore the similar languages could reduce their restoration quality, but in the case of the low resourced languages this technique could help. To increase the quality of the multi1M system we trained a similar system, but we inserted the language codes as the first word token into the beginning of all segments, as a result the quality drop could be eliminated. This system is called multi1M+lang.

Thirdly, we compared our method with the previous state-of-the-art machine translation based solutions for Hungarian language case [ 8 ]. For quality comparison were trained, such as SMT without the morphological analyzer described by Novák [ 13 ] (SMT) and RNN-based neural machine translation system (NMTRNN) [ 10 ].

To train the SMT system framework Moses [ 5 ] was used, where the language model was created with KenLM [ 3 ]. During the training default settings of the system were used, and also the word-binding and rearranging steps were leaved. These steps are not needed in the case of accent recovery as the word count and word order is the same on both - source and target - side. The text preprocessing phase consists of tokenization and “truecase” step. Trucaseing is the process, which decides if the initial word of the sentence is basically used with lowercase or uppercase. Accordingly, during the postprocessing detruecase and detokenization steps are performed. In the case of NMT-RNN MarianNMT was used with s2s model type trained on BPE tokenized data.

5. Results and Evaluation

Precision, recall and absolute accuracy of the word-based results were measured in our research. Since the originally correct words may change during machine translation, it is necessary to check the accuracy of the translation for all words (ALL). Furthermore these metrics were calculated only for the translation of the words which could contain diacritic characters.

In Table 1 the quality of the restoration system is shown for all monolingual models, in which 14 diferent translation systems were trained on monolingual data. We can see that all of our systems reached higher accuracy than 94% and half of them are above 99%. Unfortunately, we were not able to check the reference data of all system manually, since we do not have any language specialist for most of the languages. Against the huge amount of the Romanian data it reached the lowest monolingual models F1 acc 99.29% 99.79% 99.15% 99.78% 99.31% 99.61% 99.20% 99.50% 97.07% 99.31% 98.62% 99.18% 98.54% 99.15% 96.84% 99.08% 97.89% 98.70% 94.88% 98.59% 95.57% 97.70% 91.17% 97.51% 93.32% 95.33% 89.76% 94.97% multi1M model F1 acc 98.68% 99.62% 97.91% 99.48% 97.61% 98.75% 94.46% 96.80% 95.86% 99.03% 96.32% 97.84% 97.32% 98.50% 96.14% 98.89% 95.94% 97.61% 99.20% 99.76% 95.10% 97.48% 90.47% 97.34% 94.16% 95.86% 88.20% 94.27% 95.82% 98.16% quality. During the human evaluation of the training data we found, that more than 20% of the data were lack of diacritic symbols, which could be the explanation of this result.

In the other two columns the qualities of the multilingual systems are listed. For comparability of the systems the shown results were calculated on the same test set as the monolingual ones. None of the test sentences were part of the training or the validation data. We can see that the language code insertion into the beginning of the segments increases the quality of the multi1M system significantly, and all of them are comparable with the monolingual ones. It is really interesting to see, that multi1M+lang system outperforms the quality of the less efective monolingual ones, which means that it could use some information from other languages. It is really important to note, that multi1M system is trained on corpora, which includes only 1M-1M segments from all languages. In the last two rows the multi1M systems were evaluated on multilingual test sets, which were separated from their own training data.

Finally we compared our system with the previous state-of-the-art machine translation based solutions tested on Hungarian language. During our first measurement we have found that there are several misspelled words in our reference sets, so we corrected them manually in the case of the Hungarian one. The results with this modified test set are shown in Table 3. From this table we can see, that our system significantly outperforms the previous MT based systems from every point of view. We found that against SMT and NMT-RNN systems NMT-TM did SMT NMT-RNN NMT-TM

Error type hova - hová (where), tied - tiéd (yours) ref: Érdekelné ez a dolog? res: Érdekelne ez a dolog? ref: Különben nem hoznák haza. res: Különben nem hoznak haza.

Liúról - Liuról, Ramával - Rámával még - meg, melyen - mélyen, teli - téli not replace the correct words (without accent) to incorrectly spelled ones.

During the deeper analysis we found that from the 18,438 tokens (8,957 type) only 69 words difer from the reference. When the results were manually evaluated, we found that more than half of the errors (38 pieces) had equivalent meaning or correct replaceable form (e.g.: hova-hová (where); tied-tiéd (yours) etc.). The rest 31 words were incorrectly restored indeed; 14 words were foreign proper names and 17 had ambiguous meaning with and without accent. Most of these cases would need further context for disambiguation. (e.g. 2) (2) REF: Különben nem hoznák haza. (Otherwise, they may not bring her home.)

RES: Különben nem hoznak haza. (Otherwise, they will not bring me home.) The other advantage of the multilingual models is the significant learning time reduction. On average, one monolingual model training time was 36 hours, which for 14 languages took about 504 hours. Compare with The multilingual model training time was 33 hours.

6. Conclusion

In this research a diacritic restoration system was created based on the state-ofthe-art neural machine translation techniques. First of all, this system was trained on 14 diferent East-Central languages. In most cases our system performs accuracy over 99%. Secondly, two multilingual models were created, which reduce the hardware requirements and training time of the system, besides gain comparable performance. Finally, our method was compared with the existing state-of-the-art Hungarian accent restoration systems. Our system reaches 99.83% relative accuracy, which significantly outperforms them. We created a demo site, where the system can be tried. In the future, we would like to extend our multilingual models with the remaining Latin based languages, such as German, French, etc. Acknowledgements. This research was implemented with support provided by the Artificial Intelligence National Excellence Program (grant no.: 2018-1.2.1-NKP2018-00008). [6] Kudo, T. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Melbourne, Australia, July 2018), Association for Computational Linguistics, pp. 66–75. [19] Ács, J., and Halmi, J. Hunaccent: Small footprint diacritic restoration for social media. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (Portoroz, Slovenia, 2016), pp. 3526–3529.

[1] Bahdanau , D. , Cho , K. , and Bengio , Y. Neural machine translation by jointly learning to align and translate . In 3rd International Conference on Learning Representations, ICLR 2015 , San Diego, CA, USA, May 7- 9 , 2015 , Conference Track Proceedings ( 2015 ),

Bengio and Y. LeCun, Eds.

[2] Barrault , L. , Bojar , O. , Costa-jussà, M. R. , Federmann , C. , Fishel , M. , Graham , Y. , Haddow , B. , Huck , M. , Koehn , P. , Malmasi , S. , Monz , C. , MĂźller , M. , Pal , S. , Post , M. , and Zampieri , M. Findings of the 2019 conference on machine translation (wmt19) . In Proceedings of the Fourth Conference on Machine Translation (Volume 2 :

Shared

Task Papers , Day 1) (Florence, Italy, August 2019 ), Association for Computational Linguistics , pp. 1 - 61 .

[3] Heafield , K. KenLM: faster and smaller language model queries . In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation (Edinburgh , Scotland, United Kingdom, July 2011 ), pp. 187 - 197 .

[4] Junczys-Dowmunt , M. , Grundkiewicz , R. , Dwojak , T. , Hoang , H. , Heafield , K. , Neckermann , T. , Seide , F. , Germann , U. , Aji , A. F. , Bogoychev , N. , Martins , A. F. T. , and Birch , A. Marian: Fast neural machine translation in C++ . In Proceedings of ACL 2018 , System Demonstrations (Melbourne, Australia , July 2018 ), Association for Computational Linguistics , pp. 116 - 121 .

[5] Koehn , P. , Hoang , H. , Birch , A. , Callison-Burch , C. , Federico , M. , Bertoldi , N. , Cowan , B. , Shen , W. , Moran , C. , Zens , R. , Dyer , C. , Bojar , O. , Constantin , A. , and Herbst , E. Moses: Open source toolkit for statistical machine translation . In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (Stroudsburg , PA, USA, 2007 ), ACL '07, Association for Computational Linguistics, pp. 177 - 180 .

[7] Kudo , T. , and Richardson , J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (Brussels, Belgium, Nov. 2018 ), Association for Computational Linguistics , pp. 66 - 71 .

[8] Laki , L. J. , and Yang , Z. G. Automatikus ékezetvisszaállítás transzformer modellen alapuló neurális gépi fordítással . XVI. Magyar Számítógépes Nyelvészeti Konferencia ( 2020 ), 181 - 190 .

[9] Mihalcea , R. , and Nastase , V. Letter level learning for language independent diacritics restoration . In Proceedings of the 6th Conference on Natural Language Learning - Volume 20 ( Stroudsburg , PA, USA, 2002 ), COLING-02, Association for Computational Linguistics, pp. 1 - 7 .

[10] Nagy , P. Magyar nyelvű zajos szövegek automatikus normalizálása . Master's thesis , Pázmány Péter Katolikus Egyetem , 2018 .

[11] Náplava , J. , Straka , M. , Straňák , P. , and Hajič , J. Diacritics restoration using neural networks . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018 ) (Miyazaki , Japan, May 2018 ), European Language Resources Association (ELRA).

[12] Németh , G. , Zainkó , C. , Fekete , L. , Olaszy , G. , Endrédi , G. , Olaszi , P. , Kiss , G. , and Kis , P. The design, implementation, and operation of a hungarian e-mail reader . International Journal of Speech Technology 3 , 3 (Dec 2000 ), 217 - 236 .

[13] Novák , A. , and Siklósi , B. Automatic diacritics restoration for Hungarian . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (Lisbon, Portugal, Sept. 2015 ), Association for Computational Linguistics , pp. 2286 - 2291 .

[14] Scannell , K. P. Statistical unicodification of african languages . sources and Evaluation 45 , 3 (Jun 2011 ), 375 .

[15] Sennrich , R. , Haddow , B. , and Birch , A. Neural machine translation of rare words with subword units . CoRR abs/1508 .07909 ( 2015 ).

[16] Vaswani , A. , Shazeer , N. , Parmar , N. , Uszkoreit , J. , Jones , L. , Gomez , A. N. , Kaiser , L. u., and Polosukhin , I. Attention is all you need . In Advances in Neural Information Processing Systems 30, I. Guyon,

U. V.

Luxburg ,

Bengio ,

Wallach ,

Fergus ,

Vishwanathan , and R. Garnett, Eds. Curran Associates, Inc., 2017 , pp. 5998 - 6008 .

[17] Yarowsky , D. A Comparison of Corpus-Based Techniques for Restoring Accents in Spanish and French Text . Springer Netherlands, Dordrecht, 1999 , pp. 99 - 120 .

[18] Zweigenbaum , P. , and Grabar , N. Accenting unknown words in a specialized language . In Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain (Phildadelphia , Pennsylvania, USA, July 2002 ), Association for Computational Linguistics , pp. 21 - 28 .