1. Introduction

A Coruña, Spain $ john.ortega@usc.gal (J. E. Ortega); iria.dedios@usc.gal (I. de-Dios-Flores); jramon.pichel@usc.gal (J. R. Pichel); pablo.gamallo@usc.gal (P. Gamallo)

A neural machine translation system for Galician from transliterated Portuguese text

John E. Ortega

Iria de-Dios-Flores

José Ramom Pichel

Pablo Gamallo

0 0 Centro de Investigación en Tecnoloxías da Información (CITIUS), Universidad de Santiago de Compostela , Spain

2022

000 0 0002

We present a neural machine translation (NMT) system for translating both Spanish and English to Galician ( - and - ). Galician is a language closely related to Portuguese, with low to medium resources, spoken in northwestern Spain. Our NMT system is trained on large-scale synthetic → → and → → parallel corpora created by the spelling transliteration of Portuguese to Galician from a high-quality Spanish to Portuguese ( - ) and English to Portuguese ( - ) translation memories. The NMT system is then made available via a public web interface at https://demos.citius.usc.es/nos_tradutor.

eol>Galician Language Neural Machine Translation Transliteration

1. Introduction 2. Method

Our translation strategy consists of two steps. The ifrst step uses transliteration [ 10 ] to create parallel Galician segments from the Portuguese segments in the aligned corpus, by making using of the transliteration tool port2gal1, which contains several hundreds of rules on characters and sequences of characters. Both training and validation sets are transliterated leaving a final parallel Galician corpus. Then, in the second step, the Galician (transliterated) cor1https://github.com/gamallo/port2gal system lstm lstm lstm transformer transformer lstm transformer pair es-gl es-gl es-gl es-gl es-gl en-gl en-gl

source

Europarl+CLUVI Europarl+CLUVI+OpenSubt(part) Europarl+CLUVI+OpenSubt Europarl+CLUVI Europarl+CLUVI+OpenSubt Europarl+OpenSubt Europarl+OpenSubt

corpus size 2.35M 5M 30M 2.35M 30M 27.M 27.M bleu pus is used to train an NMT system with Spanish or OpenSubtitles4, containing about 30 million senEnglish as the source language and Galician as the tences in – and 25 in – . The Portarget language. For the first transliteration step, tuguese partition was transliterated to Galician so we also tested a more complex strategy by combin- as to build – and – parallel corpora. ing PT→GL Apertium translator [ 2 ], which uses In addition, we also added the Spanish-Galician para basic bilingual dictionary to translate word by tition of CLUVI5, to the – corpus, containing word, with the transliteration tool for those words 144 thousand sentences. that are not in the bilingual dictionary.

The NMT system that we use for ES–GL and EN–GL translations was created using OpenNMT 4. Test results [ 11 ], a generic deep learning framework for creating sequence-to-sequence models in machine translation. Table 1 show the results of diferent experiments for Imnepmaorrtiyc)ulsaerq,2wseeqtrmaiondeedl aasLwSTelMla(sloanTgrsahnosrftortmeremr or T–ransafonrdm er, –withctohmebsinizinegotfhtehseysctoermpu,sL.STWMe moCdoenl cfeorrneinagchLlSaTnMgu,awgee puaseird. the following default goubsaegrevse(that– LST),Mbuwtoforkrsthveerpyawire(ll for–clos)e, tlawnoneural network training parameters: two hidden lay- distant languages, the results are slightly better ers, 500 hidden LSTM units per layer, input feeding with Transformer. In addition, we also observe enabled, 13 epochs, batch size of 64. Alternatively, that the whole OpenSubtitles corpus hurts the perwe modified the default learning step parameters to formance in – . The best results in – 100,000 training steps and 10,000 validation steps. combine Europarl with OpenSubtitles and are comTraditional tokenization was performed with Lin- parable to the state-of-the-art [ 15 ]. Let us note that the Movie and TV subtitles of OpenSubtitles are guTakhiet [T1r2a]nsformer implementation, described in a highly valuable resource but the quality of the Garg et al. [ 13 ], was configured with default training resulting sentence alignments is often lower than for parameters: 6 layers for both encoding and decoding other parallel corpora [ 16 ]. The results in Table 1 and batch size of 4096 tokens. We also modified the allow us to confirm that using transliteration belearning step parameters to the same values as the tween two closely aligned languages like Portuguese LSTM configuration. In this case, we used sub-word and Galician, favorable outcomes can be achieved. tokenization, performed with SentencePiece [ 14 ].

5. Demonstration 3. Corpora

Our demonstration is made up of a public-facing web page6 that provides Galician translations for Tsyhsetemmacinompaerfarloleml sOouprucse2s. wIne upsaerdtitcoultarraiwnetuhseeNd MthTe both Spanish and English inputs. Users will be with– aboauntd2 mil–lion speanrtteinticoenss poefrbloatnhgEuaugroe,paarnl3d, (asbeleeFtiogutreest1t)hwehseyrsetetmheyvicaoualnd ospeleenctwtehbe ilnantegrufaagcee pair ( – or – ) and translation system

2https://opus.nlpl.eu

3https://opus.nlpl.eu/Europarl.php

4https://opus.nlpl.eu/OpenSubtitles.php 5https://repositori.upf.edu/handle/10230/20051 6https://demos.citius.usc.es/nos_tradutor

(LSTM or Transformer) to then enter text and generate translations.

In our demonstration, we plan to show where our system performs well and where it does not perform well. As an example, the sentence translated from Spanish to Galician using the LSTM system in Table 2 is an excellent translation despite its long length. Additionally, our system translations perform well with syntax and seem to generally translate better than previous systems tested on the same domain. Nonetheless, we have found that when comparing our system’s performance for lexical and morphological quality, the Portuguese transliteration afect the performance, found to be better on other rule-based MT systems like Apertium [ 2 ] for example.

6. Future work

We plan to perform further work with a human-inthe-loop to increase the performance based on quality. This is outlined by a continuous improvement plan which insinuates the inclusion of translators for user functionality tests. For example, spelling and lexical issues such as acidente instead of accidente, formal Galician diferences that need to be addressed are first to be solved using newly-developed heuristics as part of our future contingency plan. The aim will be to create the highest-quality system in order expand the language pairs to other languages such as Russian or Chinese.

Acknowledgments

This research was funded by the project “Nós: Galician in the society and economy of artificial intelligence”, agreement between Xunta de Galicia and University of Santiago de Compostela, and grant ED431G2019/04 by the Galician Ministry of Education, University and Professional Training, and the European Regional Development Fund (ERDF/FEDER program).

Spanish Debemos imponer el cumplimiento de los reglamentos

y velar por que se aplique el principio de que “el que contamina paga” para que se utilicen sanciones y también incentivos financieros a fin de presionar a los propietarios de los buques y las compañías petroleras y lograr que se introduzcan los procedimientos mejores.

Galician Temos de impor o cumpremento dos regulamentos e celar

por que o principio do poluidor-pagador sexa aplicado para que sexan utilizadas sancións e tamén incentivos financeiros a fin de exercer presión sobre os proprietarios dos navíos e das compañías petrolíferas e conseguir que os procedementos mellores sexan introducidos.

[1]

Knowles ,

Ortega ,

Koehn , A comparison of machine translation paradigms for use in black-box fuzzy-match repair , in: Proceedings of the AMTA 2018 Workshop on Translation Quality Estimation and Automatic PostEditing , 2018 , pp. 249 - 255 .

[2]

M. L.

Forcada ,

Ginestí-Rosell ,

Nordfalk , J. O'Regan , S.

Ortiz-Rojas , J. A.

Pérez-Ortiz , F.

Sánchez-Martínez , G.

Ramírez-Sánchez , F. M.

Tyers , Apertium: a free/open-source platform for rule-based machine translation , Machine translation 25 ( 2011 ) 127 - 144 .

[3]

Bahdanau ,

Cho , Y. Bengio, Neural machine translation by jointly learning to align and translate , arXiv preprint arXiv:1409.0473 ( 2014 ).

[4]

Koehn ,

Knowles , Six challenges for neural machine translation , arXiv preprint arXiv:1706.03872 ( 2017 ).

[5]

Gu ,

Wang ,

Cho ,

V. O.

Li , Improved zero-shot neural machine translation via ignoring spurious correlations, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Florence, Italy, 2019 , pp. 1258 - 1268 . URL: https:// aclanthology.org/P19-1121. doi: 10 .18653/v1/ P19 -1121.

[6]

J. R. P.

Campos ,

P. M.

Fernández ,

Gomez ,

Gamallo ,

A. C.

García , Carvalho: Englishgalician smt system from europarl englishportuguese parallel corpus , Procesamiento Del Lenguaje Natural ( 2009 ) 379 - 381 .

[7]

J. E.

Ortega ,

R. C.

Mamani , K. Cho, Neural machine translation with a polysynthetic low resource language , Machine Translation 34 ( 2020 ) 325 - 346 .

[8]

J. E.

Ortega ,

R. A.

Castro-Mamani ,

J. R.

Montoya Samame , Overcoming resistance: The normalization of an Amazonian tribal language , in: Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, Association for Computational Linguistics , Suzhou, China, 2020 , pp. 1 - 13 . URL: https://aclanthology.org/ 2020 .loresmt- 1 .1.

URL: https://aclanthology.org/L18-1275.

[9]

J. R.

Pichel ,

Gamallo , I. Alegria,

Neves , A methodology to measure the diachronic language distance between three languages based on perplexity , Journal of Quantitative Linguistics 28 ( 2021 ) 306 - 336 .

[10]

Knight ,

Graehl , Machine transliteration, arXiv preprint cmp-lg/9704003 ( 1997 ).

[11]

Klein ,

Kim ,

Deng ,

Senellart , A . Rush, OpenNMT: Open-source toolkit for neural machine translation , in: Proceedings of ACL 2017 ,

System

Demonstrations ., Association for Computational Linguistics , Vancouver, Canada, 2017 , pp. 67 - 72 . URL: https: //www.aclweb.org/anthology/P17-4012.

[12]

Gamallo ,

Garcia ,

Piñeiro ,

Martinez-Castaño ,

J. C.

Pichel , LinguaKit: A Big Data-Based Multilingual Tool for Linguistic Analysis and Information Extraction , in: 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS) , 2018 , pp. 239 - 244 . doi: 10 .1109/SNAMS. 2018 . 8554689 .

[13]

Garg ,

Peitz ,

Nallasamy ,

Paulik , Jointly learning to align and translate with transformer models , CoRR abs/ 1909 . 02074 ( 2019 ). URL: http://arxiv.org/abs/ 1909 . 02074 . arXiv: 1909 . 02074 .

[14]

Kudo ,

Richardson , Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing , arXiv preprint arXiv: 1808 . 06226 ( 2018 ).

[15] M. D. C. Bayón , P. Sánchez-Gijón , Evaluating machine translation in a low-resource language combination: Spanish-galician ., in: Machine Translation Summit XVII Vol. 2 : Translator , Project and User Tracks , 2019 , pp. 30 - 35 .

[16]

Lison ,

Tiedemann , M. Kouylekov, OpenSubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora , in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018 ), European Language Resources Association (ELRA), Miyazaki , Japan, 2018 .