A neural machine translation system for Galician from transliterated Portuguese t ext Un sistema de tradución neuronal para el gallego a partir de texto portugués transliterado

A neural machine translation system for Galician from transliterated Portuguese t ext Un sistema de tradución neuronal para el gallego a partir de texto portugués transliterado JohnEOrtega john.ortega@usc.gal Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidad de Santiago de Compostela

Spain

IriaDe-Dios-Flores Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidad de Santiago de Compostela

Spain

JoséRamomPichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidad de Santiago de Compostela

Spain

PabloGamallo pablo.gamallo@usc.gal Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidad de Santiago de Compostela

Spain

A neural machine translation system for Galician from transliterated Portuguese t ext Un sistema de tradución neuronal para el gallego a partir de texto portugués transliterado 1613-0073 BEE112472573098618A0D59E726F1CF8 10.18653/v1/ GROBID - A machine learning software for extracting information from scholarly documents Galician Language Neural Machine Translation Transliteration

We present a neural machine translation (NMT) system for translating both Spanish and English to Galician (𝐸𝑆-𝐺𝐿 and 𝐸𝑁 -𝐺𝐿). Galician is a language closely related to Portuguese, with low to medium resources, spoken in northwestern Spain. Our NMT system is trained on large-scale synthetic 𝐸𝑆 → 𝑃 𝑇 → 𝐺𝐿 and 𝐸𝑁 → 𝑃 𝑇 → 𝐺𝐿 parallel corpora created by the spelling transliteration of Portuguese to Galician from a high-quality Spanish to Portuguese (𝐸𝑆-𝑃 𝑇 ) and English to Portuguese (𝐸𝑁 -𝑃 𝑇 ) translation memories. The NMT system is then made available via a public web interface at https://demos.citius.usc.es/nos_tradutor.

Introduction

Several systems have been compared and developed to perform machine translation (MT), ranging from rule-based systems to systems based on neural networks [1] Traditionally, rule-based systems like Apertium [2] are used for languages with a small amount of parallel data. That is because MT systems backed by neural networks, or neural machine translation (NMT) systems, require high amounts of data, typically on the order of millions of sentences or more [3,4]. An interesting option for low-resource languages is the use of zero-shot translation techniques, that is, translating in multilingual settings between language pairs for which the NMT system has never been trained. However, as Gu et al. [5] state, training zero-shot NMT models easily fails as this task is very sensitive to hyper-parameter setting. The performance of zero-shot strategies is usually lower than that of more conventional pivot-based approaches.

We describe and implement an approach inspired by previous work [6] that uses the proximity of Portuguese and Galician to overcome the lack of resources problem and produces corpora to build an NMT system, similar to low-resource NMT systems found in previous work [7,8], for translating both Spanish to Galician and English to Galician. Our system first uses high-quality Spanish-Portuguese (ES-PT) and English-Portuguese (EN-PT) parallel corpora to translate the target-sided (Portuguese) sentences (or segments) to Galician using transliteration, the conversion of text in one language to another through spelling. Transliteration between Portuguese and Galician works well due to the orthographic nearness of the two languages found in previous work [9]. Second, NMT systems with the transliterated Galician parallel text are created to form a Spanish-Galician (ES-GL) and English-Galician (EN-GL) MT system where both Spanish and English are the source languages and Galician is the target language. Two different neural-based architectures were tested: Long short-term memory (LSTM) and Transformers.

Method

Our translation strategy consists of two steps. The first step uses transliteration [10] to create parallel Galician segments from the Portuguese segments in the aligned corpus, by making using of the transliteration tool port2gal1 , which contains several hundreds of rules on characters and sequences of characters. Both training and validation sets are transliterated leaving a final parallel Galician corpus. Then, in the second step, the Galician (transliterated) cor- Results obtained for the two language pairs (𝐸𝑆-𝐺𝐿 and 𝐸𝑁 -𝐺𝐿) evaluated on two different systems, LSTM and Transformer, by making use of three quantitative measures: BLEU, TER and ChrF2. The corpus size is quantified in millions of sentences (M).

pus is used to train an NMT system with Spanish or English as the source language and Galician as the target language. For the first transliteration step, we also tested a more complex strategy by combining PT→GL Apertium translator [2], which uses a basic bilingual dictionary to translate word by word, with the transliteration tool for those words that are not in the bilingual dictionary.

The NMT system that we use for ES-GL and EN-GL translations was created using OpenNMT [11], a generic deep learning framework for creating sequence-to-sequence models in machine translation. In particular, we trained a LSTM (long short term memory) seq2seq model as well as a Transformer model for each language pair.

Concerning LSTM, we used the following default neural network training parameters: two hidden layers, 500 hidden LSTM units per layer, input feeding enabled, 13 epochs, batch size of 64. Alternatively, we modified the default learning step parameters to 100,000 training steps and 10,000 validation steps. Traditional tokenization was performed with Linguakit [12] The Transformer implementation, described in Garg et al. [13], was configured with default training parameters: 6 layers for both encoding and decoding and batch size of 4096 tokens. We also modified the learning step parameters to the same values as the LSTM configuration. In this case, we used sub-word tokenization, performed with SentencePiece [14].

Corpora

The main parallel sources we used to train the NMT system come from Opus2 . In particular we used the 𝐸𝑆-𝑃 𝑇 and 𝐸𝑁 -𝑃 𝑇 partitions of both Europarl3 , with about 2 million sentences per language, and OpenSubtitles4 , containing about 30 million sentences in 𝐸𝑆-𝑃 𝑇 and 25 in 𝐸𝑁 -𝑃 𝑇 . The Portuguese partition was transliterated to Galician so as to build 𝐸𝑆-𝐺𝐿 and 𝐸𝑁 -𝐺𝐿 parallel corpora. In addition, we also added the Spanish-Galician partition of CLUVI5 , to the 𝐸𝑆-𝐺𝐿 corpus, containing 144 thousand sentences.

Test results

Table 1 show the results of different experiments for 𝐸𝑆-𝐺𝐿 and 𝐸𝑁 -𝐺𝐿 combining the system, LSTM or Transformer, with the size of the corpus. We observe that LSTM works very well for close languages (𝐸𝑆-𝐺𝐿), but for the pair (𝐸𝑁 -𝐺𝐿), two distant languages, the results are slightly better with Transformer. In addition, we also observe that the whole OpenSubtitles corpus hurts the performance in 𝐸𝑆-𝐺𝐿. The best results in 𝐸𝑆-𝐺𝐿 combine Europarl with OpenSubtitles and are comparable to the state-of-the-art [15]. Let us note that the Movie and TV subtitles of OpenSubtitles are a highly valuable resource but the quality of the resulting sentence alignments is often lower than for other parallel corpora [16]. The results in Table 1 allow us to confirm that using transliteration between two closely aligned languages like Portuguese and Galician, favorable outcomes can be achieved.

Demonstration

Our demonstration is made up of a public-facing web page6 that provides Galician translations for both Spanish and English inputs. Users will be able to test the system via an open web interface (see Figure 1) where they could select the language pair (𝐸𝑆-𝐺𝐿 or 𝐸𝑁 -𝐺𝐿) and translation system (LSTM or Transformer) to then enter text and generate translations.

In our demonstration, we plan to show where our system performs well and where it does not perform well. As an example, the sentence translated from Spanish to Galician using the LSTM system in Table 2 is an excellent translation despite its long length. Additionally, our system translations perform well with syntax and seem to generally translate better than previous systems tested on the same domain. Nonetheless, we have found that when comparing our system's performance for lexical and morphological quality, the Portuguese transliteration affect the performance, found to be better on other rule-based MT systems like Apertium [2] for example.

Future work

We plan to perform further work with a human-inthe-loop to increase the performance based on quality. This is outlined by a continuous improvement plan which insinuates the inclusion of translators for user functionality tests. For example, spelling and lexical issues such as acidente instead of accidente, formal Galician differences that need to be addressed are first to be solved using newly-developed heuristics as part of our future contingency plan. The aim will be to create the highest-quality system in order expand the language pairs to other languages such as Russian or Chinese.

Spanish

Galician Debemos imponer el cumplimiento de los reglamentos y velar por que se aplique el principio de que "el que contamina paga" para que se utilicen sanciones y también incentivos financieros a fin de presionar a los propietarios de los buques y las compañías petroleras y lograr que se introduzcan los procedimientos mejores. Temos de impor o cumpremento dos regulamentos e celar por que o principio do poluidor-pagador sexa aplicado para que sexan utilizadas sancións e tamén incentivos financeiros a fin de exercer presión sobre os proprietarios dos navíos e das compañías petrolíferas e conseguir que os procedementos mellores sexan introducidos.

Table 2

Translation using the best performing machine translation system (LSTM).

Figure 1 :1Figure 1: A screen capture of the web interface.

Table 11systempairsourcecorpus size bleuterchrF2lstmes-glEuroparl+CLUVI2.35M48.9 34.469.3lstmes-glEuroparl+CLUVI+OpenSubt(part)5M51.132.870.8lstmes-glEuroparl+CLUVI+OpenSubt30M46.0 37.266.5transformeres-glEuroparl+CLUVI2.35M17.5 67.453.0transformeres-glEuroparl+CLUVI+OpenSubt30M13.9 66.746.4lstmen-glEuroparl+OpenSubt27.M26.6 50.345.5transformer en-glEuroparl+OpenSubt27.M29.349.751.0

https://github.com/gamallo/port2gal https://opus.nlpl.eu https://opus.nlpl.eu/Europarl.php https://opus.nlpl.eu/OpenSubtitles.php https://repositori.upf.edu/handle/10230/20051 https://demos.citius.usc.es/nos_tradutor

Acknowledgments

This research was funded by the project "Nós: Galician in the society and economy of artificial in-telligence", agreement between Xunta de Galicia and University of Santiago de Compostela, and grant ED431G2019/04 by the Galician Ministry of Education, University and Professional Training, and the European Regional Development Fund (ERDF/FEDER program).

A comparison of machine translation paradigms for use in black-box fuzzy-match repair RKnowles JOrtega PKoehn Proceedings of the AMTA 2018 Workshop on Translation Quality Estimation and Automatic Post-Editing the AMTA 2018 Workshop on Translation Quality Estimation and Automatic Post-Editing 2018 Apertium: a free/open-source platform for rule-based machine translation MLForcada MGinestí-Rosell JNordfalk JO'regan SOrtiz-Rojas JAPérez-Ortiz FSánchez-Martínez GRamírez-Sánchez FMTyers Machine translation 25 2011 DBahdanau KCho YBengio arXiv:1409.0473 Neural machine translation by jointly learning to align and translate 2014 arXiv preprint PKoehn RKnowles arXiv:1706.03872 Six challenges for neural machine translation 2017 arXiv preprint Improved zero-shot neural machine translation via ignoring spurious correlations JGu YWang KCho VOLi 10.18653/v1/P19-1121 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics

Florence, Italy

2019 Carvalho: Englishgalician smt system from europarl englishportuguese parallel corpus JR PCampos PMFernández OGomez PGamallo ACGarcía Procesamiento Del Lenguaje Natural 2009 Neural machine translation with a polysynthetic low resource language JEOrtega RCMamani KCho Machine Translation 34 2020 Overcoming resistance: The normalization of an Amazonian tribal language JEOrtega RACastro-Mamani JRMontoya Samame Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, Association for Computational Linguistics the 3rd Workshop on Technologies for MT of Low Resource Languages, Association for Computational Linguistics

Suzhou, China

2020 A methodology to measure the diachronic language distance between three languages based on perplexity JRPichel PGamallo IAlegria MNeves Journal of Quantitative Linguistics 28 2021 KKnight JGraehl arXiv preprint cmp-lg/9704003 Machine transliteration 1997 OpenNMT: Open-source toolkit for neural machine translation GKlein YKim YDeng JSenellart ARush Proceedings of ACL 2017, System Demonstrations ACL 2017, System Demonstrations

Vancouver, Canada

2017 Association for Computational Linguistics Lin-guaKit: A Big Data-Based Multilingual Tool for Linguistic Analysis and Information Extraction PGamallo MGarcia CPiñeiro RMartinez-Castaño JCPichel 10.1109/SNAMS.2018.8554689 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS) 2018 SGarg SPeitz UNallasamy MPaulik CoRR abs/1909.02074 Jointly learning to align and translate with transformer models 2019 TKudo JRichardson arXiv:1808.06226 Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing 2018 arXiv preprint Evaluating machine translation in a low-resource language combination: Spanish-galician MD CBayón PSánchez-Gijón Translator, Project and User Tracks 2019 2 Machine Translation Summit XVII Open-Subtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora PLison JTiedemann MKouylekov Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA) the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA)

Miyazaki, Japan

2018