A neural machine translation system for Galician from transliterated Portuguese t ext Un sistema de tradución neuronal para el gallego a partir de texto portugués transliterado John E. Ortega, Iria de-Dios-Flores, José Ramom Pichel and Pablo Gamallo Centro de Investigación en Tecnoloxías da Información (CITIUS), Universidad de Santiago de Compostela, Spain Abstract We present a neural machine translation (NMT) system for translating both Spanish and English to Galician (𝐸𝑆–𝐺𝐿 and 𝐸𝑁 –𝐺𝐿). Galician is a language closely related to Portuguese, with low to medium resources, spoken in northwestern Spain. Our NMT system is trained on large-scale synthetic 𝐸𝑆 → 𝑃 𝑇 → 𝐺𝐿 and 𝐸𝑁 → 𝑃 𝑇 → 𝐺𝐿 parallel corpora created by the spelling transliteration of Portuguese to Galician from a high-quality Spanish to Portuguese (𝐸𝑆–𝑃 𝑇 ) and English to Portuguese (𝐸𝑁 –𝑃 𝑇 ) translation memories. The NMT system is then made available via a public web interface at https://demos.citius.usc.es/nos_tradutor. Keywords Galician Language, Neural Machine Translation, Transliteration 1. Introduction Portuguese and Galician to overcome the lack of resources problem and produces corpora to build an Several systems have been compared and devel- NMT system, similar to low-resource NMT systems oped to perform machine translation (MT), ranging found in previous work [7, 8], for translating both from rule-based systems to systems based on neural Spanish to Galician and English to Galician. Our networks [1] Traditionally, rule-based systems like system first uses high-quality Spanish–Portuguese Apertium [2] are used for languages with a small (ES–PT) and English–Portuguese (EN–PT) parallel amount of parallel data. That is because MT sys- corpora to translate the target-sided (Portuguese) tems backed by neural networks, or neural machine sentences (or segments) to Galician using translit- translation (NMT) systems, require high amounts of eration, the conversion of text in one language to data, typically on the order of millions of sentences another through spelling. Transliteration between or more [3, 4]. An interesting option for low-resource Portuguese and Galician works well due to the or- languages is the use of zero-shot translation tech- thographic nearness of the two languages found niques, that is, translating in multilingual settings in previous work [9]. Second, NMT systems with between language pairs for which the NMT system the transliterated Galician parallel text are created has never been trained. However, as Gu et al. [5] to form a Spanish–Galician (ES–GL) and English– state, training zero-shot NMT models easily fails as Galician (EN–GL) MT system where both Spanish this task is very sensitive to hyper-parameter setting. and English are the source languages and Galician The performance of zero-shot strategies is usually is the target language. Two different neural-based lower than that of more conventional pivot-based architectures were tested: Long short-term memory approaches. (LSTM) and Transformers. We describe and implement an approach inspired by previous work [6] that uses the proximity of 2. Method SEPLN-PD 2022. Annual Conference of the Spanish Association for Natural Language Processing 2022: Our translation strategy consists of two steps. The Projects and Demonstrations, September 21-23, 2022, A first step uses transliteration [10] to create parallel Coruña, Spain Galician segments from the Portuguese segments in $ john.ortega@usc.gal (J. E. Ortega); iria.dedios@usc.gal the aligned corpus, by making using of the translit- (I. de-Dios-Flores); jramon.pichel@usc.gal (J. R. Pichel); eration tool port2gal1 , which contains several hun- pablo.gamallo@usc.gal (P. Gamallo) dreds of rules on characters and sequences of charac-  0000-0002-2328-3205 (J. E. Ortega); 0000-0002-5941-1707 (I. de-Dios-Flores); 0000-0001-5172-6803 (J. R. Pichel); ters. Both training and validation sets are translit- 0000-0002-5819-2469 (P. Gamallo) erated leaving a final parallel Galician corpus. Then, © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu- in the second step, the Galician (transliterated) cor- tion 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR- 1 Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 WS.org) https://github.com/gamallo/port2gal 92 system pair source corpus size bleu ter chrF2 lstm es-gl Europarl+CLUVI 2.35M 48.9 34.4 69.3 lstm es-gl Europarl+CLUVI+OpenSubt(part) 5M 51.1 32.8 70.8 lstm es-gl Europarl+CLUVI+OpenSubt 30M 46.0 37.2 66.5 transformer es-gl Europarl+CLUVI 2.35M 17.5 67.4 53.0 transformer es-gl Europarl+CLUVI+OpenSubt 30M 13.9 66.7 46.4 lstm en-gl Europarl+OpenSubt 27.M 26.6 50.3 45.5 transformer en-gl Europarl+OpenSubt 27.M 29.3 49.7 51.0 Table 1 Results obtained for the two language pairs (𝐸𝑆–𝐺𝐿 and 𝐸𝑁 –𝐺𝐿) evaluated on two different systems, LSTM and Transformer, by making use of three quantitative measures: BLEU, TER and ChrF2. The corpus size is quantified in millions of sentences (M). pus is used to train an NMT system with Spanish or OpenSubtitles4 , containing about 30 million sen- English as the source language and Galician as the tences in 𝐸𝑆–𝑃 𝑇 and 25 in 𝐸𝑁 –𝑃 𝑇 . The Por- target language. For the first transliteration step, tuguese partition was transliterated to Galician so we also tested a more complex strategy by combin- as to build 𝐸𝑆–𝐺𝐿 and 𝐸𝑁 –𝐺𝐿 parallel corpora. ing PT→GL Apertium translator [2], which uses In addition, we also added the Spanish-Galician par- a basic bilingual dictionary to translate word by tition of CLUVI5 , to the 𝐸𝑆–𝐺𝐿 corpus, containing word, with the transliteration tool for those words 144 thousand sentences. that are not in the bilingual dictionary. The NMT system that we use for ES–GL and EN–GL translations was created using OpenNMT 4. Test results [11], a generic deep learning framework for creating Table 1 show the results of different experiments for sequence-to-sequence models in machine translation. 𝐸𝑆–𝐺𝐿 and 𝐸𝑁 –𝐺𝐿 combining the system, LSTM In particular, we trained a LSTM (long short term or Transformer, with the size of the corpus. We memory) seq2seq model as well as a Transformer observe that LSTM works very well for close lan- model for each language pair. guages (𝐸𝑆–𝐺𝐿), but for the pair (𝐸𝑁 –𝐺𝐿), two Concerning LSTM, we used the following default distant languages, the results are slightly better neural network training parameters: two hidden lay- with Transformer. In addition, we also observe ers, 500 hidden LSTM units per layer, input feeding that the whole OpenSubtitles corpus hurts the per- enabled, 13 epochs, batch size of 64. Alternatively, formance in 𝐸𝑆–𝐺𝐿. The best results in 𝐸𝑆–𝐺𝐿 we modified the default learning step parameters to combine Europarl with OpenSubtitles and are com- 100,000 training steps and 10,000 validation steps. parable to the state-of-the-art [15]. Let us note that Traditional tokenization was performed with Lin- the Movie and TV subtitles of OpenSubtitles are guakit [12] a highly valuable resource but the quality of the The Transformer implementation, described in resulting sentence alignments is often lower than for Garg et al. [13], was configured with default training other parallel corpora [16]. The results in Table 1 parameters: 6 layers for both encoding and decoding allow us to confirm that using transliteration be- and batch size of 4096 tokens. We also modified the tween two closely aligned languages like Portuguese learning step parameters to the same values as the and Galician, favorable outcomes can be achieved. LSTM configuration. In this case, we used sub-word tokenization, performed with SentencePiece [14]. 5. Demonstration 3. Corpora Our demonstration is made up of a public-facing web page6 that provides Galician translations for The main parallel sources we used to train the NMT both Spanish and English inputs. Users will be system come from Opus2 . In particular we used the able to test the system via an open web interface 𝐸𝑆–𝑃 𝑇 and 𝐸𝑁 –𝑃 𝑇 partitions of both Europarl3 , (see Figure 1) where they could select the language with about 2 million sentences per language, and pair (𝐸𝑆–𝐺𝐿 or 𝐸𝑁 –𝐺𝐿) and translation system 4 https://opus.nlpl.eu/OpenSubtitles.php 2 5 https://opus.nlpl.eu https://repositori.upf.edu/handle/10230/20051 3 6 https://opus.nlpl.eu/Europarl.php https://demos.citius.usc.es/nos_tradutor 93 telligence”, agreement between Xunta de Galicia and University of Santiago de Compostela, and grant ED431G2019/04 by the Galician Ministry of Education, University and Professional Train- ing, and the European Regional Development Fund (ERDF/FEDER program). References [1] R. Knowles, J. Ortega, P. Koehn, A compari- son of machine translation paradigms for use in black-box fuzzy-match repair, in: Proceed- ings of the AMTA 2018 Workshop on Transla- tion Quality Estimation and Automatic Post- Editing, 2018, pp. 249–255. Figure 1: A screen capture of the web interface. [2] M. L. Forcada, M. Ginestí-Rosell, J. Nordfalk, J. O’Regan, S. Ortiz-Rojas, J. A. Pérez-Ortiz, F. Sánchez-Martínez, G. Ramírez-Sánchez, (LSTM or Transformer) to then enter text and gen- F. M. Tyers, Apertium: a free/open-source erate translations. platform for rule-based machine translation, In our demonstration, we plan to show where our Machine translation 25 (2011) 127–144. system performs well and where it does not perform [3] D. Bahdanau, K. Cho, Y. Bengio, Neural well. As an example, the sentence translated from machine translation by jointly learning to align Spanish to Galician using the LSTM system in Table and translate, arXiv preprint arXiv:1409.0473 2 is an excellent translation despite its long length. (2014). Additionally, our system translations perform well [4] P. Koehn, R. Knowles, Six challenges for with syntax and seem to generally translate better neural machine translation, arXiv preprint than previous systems tested on the same domain. arXiv:1706.03872 (2017). Nonetheless, we have found that when comparing [5] J. Gu, Y. Wang, K. Cho, V. O. Li, Improved our system’s performance for lexical and morpholog- zero-shot neural machine translation via ig- ical quality, the Portuguese transliteration affect the noring spurious correlations, in: Proceedings performance, found to be better on other rule-based of the 57th Annual Meeting of the Associa- MT systems like Apertium [2] for example. tion for Computational Linguistics, Associa- tion for Computational Linguistics, Florence, Italy, 2019, pp. 1258–1268. URL: https:// 6. Future work aclanthology.org/P19-1121. doi:10.18653/v1/ P19-1121. We plan to perform further work with a human-in- [6] J. R. P. Campos, P. M. Fernández, O. Gomez, the-loop to increase the performance based on qual- P. Gamallo, A. C. García, Carvalho: English- ity. This is outlined by a continuous improvement galician smt system from europarl english- plan which insinuates the inclusion of translators portuguese parallel corpus, Procesamiento Del for user functionality tests. For example, spelling Lenguaje Natural (2009) 379–381. and lexical issues such as acidente instead of acci- [7] J. E. Ortega, R. C. Mamani, K. Cho, Neural dente, formal Galician differences that need to be ad- machine translation with a polysynthetic low dressed are first to be solved using newly-developed resource language, Machine Translation 34 heuristics as part of our future contingency plan. (2020) 325–346. The aim will be to create the highest-quality sys- [8] J. E. Ortega, R. A. Castro-Mamani, J. R. tem in order expand the language pairs to other Montoya Samame, Overcoming resistance: languages such as Russian or Chinese. The normalization of an Amazonian tribal lan- guage, in: Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Lan- Acknowledgments guages, Association for Computational Lin- This research was funded by the project “Nós: Gali- guistics, Suzhou, China, 2020, pp. 1–13. URL: cian in the society and economy of artificial in- https://aclanthology.org/2020.loresmt-1.1. 94 Spanish Galician Debemos imponer el cumplimiento de los reglamentos Temos de impor o cumpremento dos regulamentos e celar y velar por que se aplique el principio de que “el que por que o principio do poluidor-pagador sexa aplicado para contamina paga” para que se utilicen sanciones y también que sexan utilizadas sancións e tamén incentivos finan- incentivos financieros a fin de presionar a los propietarios ceiros a fin de exercer presión sobre os proprietarios dos de los buques y las compañías petroleras y lograr que se navíos e das compañías petrolíferas e conseguir que os introduzcan los procedimientos mejores. procedementos mellores sexan introducidos. Table 2 Translation using the best performing machine translation system (LSTM). [9] J. R. Pichel, P. Gamallo, I. Alegria, M. Neves, URL: https://aclanthology.org/L18-1275. A methodology to measure the diachronic lan- guage distance between three languages based on perplexity, Journal of Quantitative Linguis- tics 28 (2021) 306–336. [10] K. Knight, J. Graehl, Machine transliteration, arXiv preprint cmp-lg/9704003 (1997). [11] G. Klein, Y. Kim, Y. Deng, J. Senellart, A. Rush, OpenNMT: Open-source toolkit for neural machine translation, in: Proceedings of ACL 2017, System Demonstrations., As- sociation for Computational Linguistics, Van- couver, Canada, 2017, pp. 67–72. URL: https: //www.aclweb.org/anthology/P17-4012. [12] P. Gamallo, M. Garcia, C. Piñeiro, R. Martinez-Castaño, J. C. Pichel, Lin- guaKit: A Big Data-Based Multilingual Tool for Linguistic Analysis and Information Extraction, in: 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), 2018, pp. 239–244. doi:10.1109/SNAMS.2018.8554689. [13] S. Garg, S. Peitz, U. Nallasamy, M. Paulik, Jointly learning to align and translate with transformer models, CoRR abs/1909.02074 (2019). URL: http://arxiv.org/abs/1909.02074. arXiv:1909.02074. [14] T. Kudo, J. Richardson, Sentencepiece: A simple and language independent subword tok- enizer and detokenizer for neural text process- ing, arXiv preprint arXiv:1808.06226 (2018). [15] M. D. C. Bayón, P. Sánchez-Gijón, Evaluating machine translation in a low-resource language combination: Spanish-galician., in: Machine Translation Summit XVII Vol. 2: Translator, Project and User Tracks, 2019, pp. 30–35. [16] P. Lison, J. Tiedemann, M. Kouylekov, Open- Subtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora, in: Proceedings of the Eleventh International Con- ference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan, 2018. 95