=Paper= {{Paper |id=Vol-3224/paper22 |storemode=property |title=A neural machine translation system for Galician from transliterated Portuguese text |pdfUrl=https://ceur-ws.org/Vol-3224/paper22.pdf |volume=Vol-3224 |authors=John E. Ortega,Iria de-Dios-Flores,José Ramom Pichel Campos,Pablo Gamallo |dblpUrl=https://dblp.org/rec/conf/sepln/OrtegadC022 }} ==A neural machine translation system for Galician from transliterated Portuguese text== https://ceur-ws.org/Vol-3224/paper22.pdf
A neural machine translation system for Galician from
transliterated Portuguese t ext
Un sistema de tradución neuronal para el gallego a partir de texto portugués transliterado

John E. Ortega, Iria de-Dios-Flores, José Ramom Pichel and Pablo Gamallo
Centro de Investigación en Tecnoloxías da Información (CITIUS), Universidad de Santiago de Compostela,
Spain

                                        Abstract
                                        We present a neural machine translation (NMT) system for translating both Spanish and English to Galician
                                        (𝐸𝑆–𝐺𝐿 and 𝐸𝑁 –𝐺𝐿). Galician is a language closely related to Portuguese, with low to medium resources,
                                        spoken in northwestern Spain. Our NMT system is trained on large-scale synthetic 𝐸𝑆 → 𝑃 𝑇 → 𝐺𝐿 and
                                        𝐸𝑁 → 𝑃 𝑇 → 𝐺𝐿 parallel corpora created by the spelling transliteration of Portuguese to Galician from a
                                        high-quality Spanish to Portuguese (𝐸𝑆–𝑃 𝑇 ) and English to Portuguese (𝐸𝑁 –𝑃 𝑇 ) translation memories. The
                                        NMT system is then made available via a public web interface at https://demos.citius.usc.es/nos_tradutor.

                                        Keywords
                                        Galician Language, Neural Machine Translation, Transliteration



1. Introduction                                                                            Portuguese and Galician to overcome the lack of
                                                                                           resources problem and produces corpora to build an
Several systems have been compared and devel-                                              NMT system, similar to low-resource NMT systems
oped to perform machine translation (MT), ranging                                          found in previous work [7, 8], for translating both
from rule-based systems to systems based on neural                                         Spanish to Galician and English to Galician. Our
networks [1] Traditionally, rule-based systems like                                        system first uses high-quality Spanish–Portuguese
Apertium [2] are used for languages with a small                                           (ES–PT) and English–Portuguese (EN–PT) parallel
amount of parallel data. That is because MT sys-                                           corpora to translate the target-sided (Portuguese)
tems backed by neural networks, or neural machine                                          sentences (or segments) to Galician using translit-
translation (NMT) systems, require high amounts of                                         eration, the conversion of text in one language to
data, typically on the order of millions of sentences                                      another through spelling. Transliteration between
or more [3, 4]. An interesting option for low-resource                                     Portuguese and Galician works well due to the or-
languages is the use of zero-shot translation tech-                                        thographic nearness of the two languages found
niques, that is, translating in multilingual settings                                      in previous work [9]. Second, NMT systems with
between language pairs for which the NMT system                                            the transliterated Galician parallel text are created
has never been trained. However, as Gu et al. [5]                                          to form a Spanish–Galician (ES–GL) and English–
state, training zero-shot NMT models easily fails as                                       Galician (EN–GL) MT system where both Spanish
this task is very sensitive to hyper-parameter setting.                                    and English are the source languages and Galician
The performance of zero-shot strategies is usually                                         is the target language. Two different neural-based
lower than that of more conventional pivot-based                                           architectures were tested: Long short-term memory
approaches.                                                                                (LSTM) and Transformers.
   We describe and implement an approach inspired
by previous work [6] that uses the proximity of
                                                                                           2. Method
SEPLN-PD 2022. Annual Conference of the Spanish
Association for Natural Language Processing 2022:
                                                                                           Our translation strategy consists of two steps. The
Projects and Demonstrations, September 21-23, 2022, A                                      first step uses transliteration [10] to create parallel
Coruña, Spain                                                                              Galician segments from the Portuguese segments in
$ john.ortega@usc.gal (J. E. Ortega); iria.dedios@usc.gal                                  the aligned corpus, by making using of the translit-
(I. de-Dios-Flores); jramon.pichel@usc.gal (J. R. Pichel);                                 eration tool port2gal1 , which contains several hun-
pablo.gamallo@usc.gal (P. Gamallo)
                                                                                           dreds of rules on characters and sequences of charac-
 0000-0002-2328-3205 (J. E. Ortega); 0000-0002-5941-1707
(I. de-Dios-Flores); 0000-0001-5172-6803 (J. R. Pichel);                                   ters. Both training and validation sets are translit-
0000-0002-5819-2469 (P. Gamallo)                                                           erated leaving a final parallel Galician corpus. Then,
                                    © 2022 Copyright for this paper by its authors. Use
                                    permitted under Creative Commons License Attribu-      in the second step, the Galician (transliterated) cor-
                                    tion 4.0 International (CC BY 4.0).
 CEUR
                                    CEUR Workshop Proceedings (CEUR-                           1
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073




                                    WS.org)                                                        https://github.com/gamallo/port2gal




                                                                                          92
         system         pair                 source                   corpus size    bleu    ter    chrF2
         lstm           es-gl           Europarl+CLUVI                   2.35M       48.9   34.4     69.3
         lstm           es-gl   Europarl+CLUVI+OpenSubt(part)              5M        51.1   32.8     70.8
         lstm           es-gl     Europarl+CLUVI+OpenSubt                 30M        46.0   37.2     66.5
         transformer    es-gl           Europarl+CLUVI                   2.35M       17.5   67.4     53.0
         transformer    es-gl     Europarl+CLUVI+OpenSubt                 30M        13.9   66.7     46.4
         lstm           en-gl         Europarl+OpenSubt                  27.M        26.6   50.3     45.5
         transformer    en-gl         Europarl+OpenSubt                   27.M       29.3   49.7     51.0
Table 1
Results obtained for the two language pairs (𝐸𝑆–𝐺𝐿 and 𝐸𝑁 –𝐺𝐿) evaluated on two different systems, LSTM and
Transformer, by making use of three quantitative measures: BLEU, TER and ChrF2. The corpus size is quantified in
millions of sentences (M).



pus is used to train an NMT system with Spanish or       OpenSubtitles4 , containing about 30 million sen-
English as the source language and Galician as the       tences in 𝐸𝑆–𝑃 𝑇 and 25 in 𝐸𝑁 –𝑃 𝑇 . The Por-
target language. For the first transliteration step,     tuguese partition was transliterated to Galician so
we also tested a more complex strategy by combin-        as to build 𝐸𝑆–𝐺𝐿 and 𝐸𝑁 –𝐺𝐿 parallel corpora.
ing PT→GL Apertium translator [2], which uses            In addition, we also added the Spanish-Galician par-
a basic bilingual dictionary to translate word by        tition of CLUVI5 , to the 𝐸𝑆–𝐺𝐿 corpus, containing
word, with the transliteration tool for those words      144 thousand sentences.
that are not in the bilingual dictionary.
   The NMT system that we use for ES–GL and
EN–GL translations was created using OpenNMT             4. Test results
[11], a generic deep learning framework for creating
                                                         Table 1 show the results of different experiments for
sequence-to-sequence models in machine translation.
                                                         𝐸𝑆–𝐺𝐿 and 𝐸𝑁 –𝐺𝐿 combining the system, LSTM
In particular, we trained a LSTM (long short term
                                                         or Transformer, with the size of the corpus. We
memory) seq2seq model as well as a Transformer
                                                         observe that LSTM works very well for close lan-
model for each language pair.
                                                         guages (𝐸𝑆–𝐺𝐿), but for the pair (𝐸𝑁 –𝐺𝐿), two
   Concerning LSTM, we used the following default
                                                         distant languages, the results are slightly better
neural network training parameters: two hidden lay-
                                                         with Transformer. In addition, we also observe
ers, 500 hidden LSTM units per layer, input feeding
                                                         that the whole OpenSubtitles corpus hurts the per-
enabled, 13 epochs, batch size of 64. Alternatively,
                                                         formance in 𝐸𝑆–𝐺𝐿. The best results in 𝐸𝑆–𝐺𝐿
we modified the default learning step parameters to
                                                         combine Europarl with OpenSubtitles and are com-
100,000 training steps and 10,000 validation steps.
                                                         parable to the state-of-the-art [15]. Let us note that
Traditional tokenization was performed with Lin-
                                                         the Movie and TV subtitles of OpenSubtitles are
guakit [12]
                                                         a highly valuable resource but the quality of the
   The Transformer implementation, described in
                                                         resulting sentence alignments is often lower than for
Garg et al. [13], was configured with default training
                                                         other parallel corpora [16]. The results in Table 1
parameters: 6 layers for both encoding and decoding
                                                         allow us to confirm that using transliteration be-
and batch size of 4096 tokens. We also modified the
                                                         tween two closely aligned languages like Portuguese
learning step parameters to the same values as the
                                                         and Galician, favorable outcomes can be achieved.
LSTM configuration. In this case, we used sub-word
tokenization, performed with SentencePiece [14].
                                                         5. Demonstration
3. Corpora                                               Our demonstration is made up of a public-facing
                                                         web page6 that provides Galician translations for
The main parallel sources we used to train the NMT
                                                         both Spanish and English inputs. Users will be
system come from Opus2 . In particular we used the
                                                         able to test the system via an open web interface
𝐸𝑆–𝑃 𝑇 and 𝐸𝑁 –𝑃 𝑇 partitions of both Europarl3 ,
                                                         (see Figure 1) where they could select the language
with about 2 million sentences per language, and
                                                         pair (𝐸𝑆–𝐺𝐿 or 𝐸𝑁 –𝐺𝐿) and translation system
                                                            4
                                                                https://opus.nlpl.eu/OpenSubtitles.php
   2                                                        5
     https://opus.nlpl.eu                                       https://repositori.upf.edu/handle/10230/20051
   3                                                        6
     https://opus.nlpl.eu/Europarl.php                          https://demos.citius.usc.es/nos_tradutor




                                                      93
                                                          telligence”, agreement between Xunta de Galicia
                                                          and University of Santiago de Compostela, and
                                                          grant ED431G2019/04 by the Galician Ministry
                                                          of Education, University and Professional Train-
                                                          ing, and the European Regional Development Fund
                                                          (ERDF/FEDER program).


                                                          References
                                                           [1] R. Knowles, J. Ortega, P. Koehn, A compari-
                                                               son of machine translation paradigms for use
                                                               in black-box fuzzy-match repair, in: Proceed-
                                                               ings of the AMTA 2018 Workshop on Transla-
                                                               tion Quality Estimation and Automatic Post-
                                                               Editing, 2018, pp. 249–255.
Figure 1: A screen capture of the web interface.           [2] M. L. Forcada, M. Ginestí-Rosell, J. Nordfalk,
                                                               J. O’Regan, S. Ortiz-Rojas, J. A. Pérez-Ortiz,
                                                               F. Sánchez-Martínez, G. Ramírez-Sánchez,
(LSTM or Transformer) to then enter text and gen-              F. M. Tyers, Apertium: a free/open-source
erate translations.                                            platform for rule-based machine translation,
   In our demonstration, we plan to show where our             Machine translation 25 (2011) 127–144.
system performs well and where it does not perform         [3] D. Bahdanau, K. Cho, Y. Bengio, Neural
well. As an example, the sentence translated from              machine translation by jointly learning to align
Spanish to Galician using the LSTM system in Table             and translate, arXiv preprint arXiv:1409.0473
2 is an excellent translation despite its long length.         (2014).
Additionally, our system translations perform well         [4] P. Koehn, R. Knowles, Six challenges for
with syntax and seem to generally translate better             neural machine translation, arXiv preprint
than previous systems tested on the same domain.               arXiv:1706.03872 (2017).
Nonetheless, we have found that when comparing             [5] J. Gu, Y. Wang, K. Cho, V. O. Li, Improved
our system’s performance for lexical and morpholog-            zero-shot neural machine translation via ig-
ical quality, the Portuguese transliteration affect the        noring spurious correlations, in: Proceedings
performance, found to be better on other rule-based            of the 57th Annual Meeting of the Associa-
MT systems like Apertium [2] for example.                      tion for Computational Linguistics, Associa-
                                                               tion for Computational Linguistics, Florence,
                                                               Italy, 2019, pp. 1258–1268. URL: https://
6. Future work                                                 aclanthology.org/P19-1121. doi:10.18653/v1/
                                                               P19-1121.
We plan to perform further work with a human-in-           [6] J. R. P. Campos, P. M. Fernández, O. Gomez,
the-loop to increase the performance based on qual-            P. Gamallo, A. C. García, Carvalho: English-
ity. This is outlined by a continuous improvement              galician smt system from europarl english-
plan which insinuates the inclusion of translators             portuguese parallel corpus, Procesamiento Del
for user functionality tests. For example, spelling            Lenguaje Natural (2009) 379–381.
and lexical issues such as acidente instead of acci-       [7] J. E. Ortega, R. C. Mamani, K. Cho, Neural
dente, formal Galician differences that need to be ad-         machine translation with a polysynthetic low
dressed are first to be solved using newly-developed           resource language, Machine Translation 34
heuristics as part of our future contingency plan.             (2020) 325–346.
The aim will be to create the highest-quality sys-         [8] J. E. Ortega, R. A. Castro-Mamani, J. R.
tem in order expand the language pairs to other                Montoya Samame, Overcoming resistance:
languages such as Russian or Chinese.                          The normalization of an Amazonian tribal lan-
                                                               guage, in: Proceedings of the 3rd Workshop
                                                               on Technologies for MT of Low Resource Lan-
Acknowledgments                                                guages, Association for Computational Lin-
This research was funded by the project “Nós: Gali-            guistics, Suzhou, China, 2020, pp. 1–13. URL:
cian in the society and economy of artificial in-              https://aclanthology.org/2020.loresmt-1.1.




                                                      94
 Spanish                                                        Galician
 Debemos imponer el cumplimiento de los reglamentos             Temos de impor o cumpremento dos regulamentos e celar
 y velar por que se aplique el principio de que “el que         por que o principio do poluidor-pagador sexa aplicado para
 contamina paga” para que se utilicen sanciones y también       que sexan utilizadas sancións e tamén incentivos finan-
 incentivos financieros a fin de presionar a los propietarios   ceiros a fin de exercer presión sobre os proprietarios dos
 de los buques y las compañías petroleras y lograr que se       navíos e das compañías petrolíferas e conseguir que os
 introduzcan los procedimientos mejores.                        procedementos mellores sexan introducidos.
Table 2
Translation using the best performing machine translation system (LSTM).




 [9] J. R. Pichel, P. Gamallo, I. Alegria, M. Neves,             URL: https://aclanthology.org/L18-1275.
     A methodology to measure the diachronic lan-
     guage distance between three languages based
     on perplexity, Journal of Quantitative Linguis-
     tics 28 (2021) 306–336.
[10] K. Knight, J. Graehl, Machine transliteration,
     arXiv preprint cmp-lg/9704003 (1997).
[11] G. Klein, Y. Kim, Y. Deng, J. Senellart,
     A. Rush, OpenNMT: Open-source toolkit for
     neural machine translation, in: Proceedings
     of ACL 2017, System Demonstrations., As-
     sociation for Computational Linguistics, Van-
     couver, Canada, 2017, pp. 67–72. URL: https:
     //www.aclweb.org/anthology/P17-4012.
[12] P. Gamallo, M. Garcia, C. Piñeiro,
     R. Martinez-Castaño, J. C. Pichel,          Lin-
     guaKit: A Big Data-Based Multilingual
     Tool for Linguistic Analysis and Information
     Extraction, in: 2018 Fifth International
     Conference on Social Networks Analysis,
     Management and Security (SNAMS), 2018, pp.
     239–244. doi:10.1109/SNAMS.2018.8554689.
[13] S. Garg, S. Peitz, U. Nallasamy, M. Paulik,
     Jointly learning to align and translate with
     transformer models, CoRR abs/1909.02074
     (2019). URL: http://arxiv.org/abs/1909.02074.
     arXiv:1909.02074.
[14] T. Kudo, J. Richardson, Sentencepiece: A
     simple and language independent subword tok-
     enizer and detokenizer for neural text process-
     ing, arXiv preprint arXiv:1808.06226 (2018).
[15] M. D. C. Bayón, P. Sánchez-Gijón, Evaluating
     machine translation in a low-resource language
     combination: Spanish-galician., in: Machine
     Translation Summit XVII Vol. 2: Translator,
     Project and User Tracks, 2019, pp. 30–35.
[16] P. Lison, J. Tiedemann, M. Kouylekov, Open-
     Subtitles2018: Statistical rescoring of sentence
     alignments in large, noisy parallel corpora, in:
     Proceedings of the Eleventh International Con-
     ference on Language Resources and Evaluation
     (LREC 2018), European Language Resources
     Association (ELRA), Miyazaki, Japan, 2018.




                                                         95