=Paper=
{{Paper
|id=Vol-3224/paper22
|storemode=property
|title=A neural machine translation system for Galician from transliterated Portuguese text
|pdfUrl=https://ceur-ws.org/Vol-3224/paper22.pdf
|volume=Vol-3224
|authors=John E. Ortega,Iria de-Dios-Flores,José Ramom Pichel Campos,Pablo Gamallo
|dblpUrl=https://dblp.org/rec/conf/sepln/OrtegadC022
}}
==A neural machine translation system for Galician from transliterated Portuguese text==
A neural machine translation system for Galician from
transliterated Portuguese t ext
Un sistema de tradución neuronal para el gallego a partir de texto portugués transliterado
John E. Ortega, Iria de-Dios-Flores, José Ramom Pichel and Pablo Gamallo
Centro de Investigación en Tecnoloxías da Información (CITIUS), Universidad de Santiago de Compostela,
Spain
Abstract
We present a neural machine translation (NMT) system for translating both Spanish and English to Galician
(𝐸𝑆–𝐺𝐿 and 𝐸𝑁 –𝐺𝐿). Galician is a language closely related to Portuguese, with low to medium resources,
spoken in northwestern Spain. Our NMT system is trained on large-scale synthetic 𝐸𝑆 → 𝑃 𝑇 → 𝐺𝐿 and
𝐸𝑁 → 𝑃 𝑇 → 𝐺𝐿 parallel corpora created by the spelling transliteration of Portuguese to Galician from a
high-quality Spanish to Portuguese (𝐸𝑆–𝑃 𝑇 ) and English to Portuguese (𝐸𝑁 –𝑃 𝑇 ) translation memories. The
NMT system is then made available via a public web interface at https://demos.citius.usc.es/nos_tradutor.
Keywords
Galician Language, Neural Machine Translation, Transliteration
1. Introduction Portuguese and Galician to overcome the lack of
resources problem and produces corpora to build an
Several systems have been compared and devel- NMT system, similar to low-resource NMT systems
oped to perform machine translation (MT), ranging found in previous work [7, 8], for translating both
from rule-based systems to systems based on neural Spanish to Galician and English to Galician. Our
networks [1] Traditionally, rule-based systems like system first uses high-quality Spanish–Portuguese
Apertium [2] are used for languages with a small (ES–PT) and English–Portuguese (EN–PT) parallel
amount of parallel data. That is because MT sys- corpora to translate the target-sided (Portuguese)
tems backed by neural networks, or neural machine sentences (or segments) to Galician using translit-
translation (NMT) systems, require high amounts of eration, the conversion of text in one language to
data, typically on the order of millions of sentences another through spelling. Transliteration between
or more [3, 4]. An interesting option for low-resource Portuguese and Galician works well due to the or-
languages is the use of zero-shot translation tech- thographic nearness of the two languages found
niques, that is, translating in multilingual settings in previous work [9]. Second, NMT systems with
between language pairs for which the NMT system the transliterated Galician parallel text are created
has never been trained. However, as Gu et al. [5] to form a Spanish–Galician (ES–GL) and English–
state, training zero-shot NMT models easily fails as Galician (EN–GL) MT system where both Spanish
this task is very sensitive to hyper-parameter setting. and English are the source languages and Galician
The performance of zero-shot strategies is usually is the target language. Two different neural-based
lower than that of more conventional pivot-based architectures were tested: Long short-term memory
approaches. (LSTM) and Transformers.
We describe and implement an approach inspired
by previous work [6] that uses the proximity of
2. Method
SEPLN-PD 2022. Annual Conference of the Spanish
Association for Natural Language Processing 2022:
Our translation strategy consists of two steps. The
Projects and Demonstrations, September 21-23, 2022, A first step uses transliteration [10] to create parallel
Coruña, Spain Galician segments from the Portuguese segments in
$ john.ortega@usc.gal (J. E. Ortega); iria.dedios@usc.gal the aligned corpus, by making using of the translit-
(I. de-Dios-Flores); jramon.pichel@usc.gal (J. R. Pichel); eration tool port2gal1 , which contains several hun-
pablo.gamallo@usc.gal (P. Gamallo)
dreds of rules on characters and sequences of charac-
0000-0002-2328-3205 (J. E. Ortega); 0000-0002-5941-1707
(I. de-Dios-Flores); 0000-0001-5172-6803 (J. R. Pichel); ters. Both training and validation sets are translit-
0000-0002-5819-2469 (P. Gamallo) erated leaving a final parallel Galician corpus. Then,
© 2022 Copyright for this paper by its authors. Use
permitted under Creative Commons License Attribu- in the second step, the Galician (transliterated) cor-
tion 4.0 International (CC BY 4.0).
CEUR
CEUR Workshop Proceedings (CEUR- 1
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
WS.org) https://github.com/gamallo/port2gal
92
system pair source corpus size bleu ter chrF2
lstm es-gl Europarl+CLUVI 2.35M 48.9 34.4 69.3
lstm es-gl Europarl+CLUVI+OpenSubt(part) 5M 51.1 32.8 70.8
lstm es-gl Europarl+CLUVI+OpenSubt 30M 46.0 37.2 66.5
transformer es-gl Europarl+CLUVI 2.35M 17.5 67.4 53.0
transformer es-gl Europarl+CLUVI+OpenSubt 30M 13.9 66.7 46.4
lstm en-gl Europarl+OpenSubt 27.M 26.6 50.3 45.5
transformer en-gl Europarl+OpenSubt 27.M 29.3 49.7 51.0
Table 1
Results obtained for the two language pairs (𝐸𝑆–𝐺𝐿 and 𝐸𝑁 –𝐺𝐿) evaluated on two different systems, LSTM and
Transformer, by making use of three quantitative measures: BLEU, TER and ChrF2. The corpus size is quantified in
millions of sentences (M).
pus is used to train an NMT system with Spanish or OpenSubtitles4 , containing about 30 million sen-
English as the source language and Galician as the tences in 𝐸𝑆–𝑃 𝑇 and 25 in 𝐸𝑁 –𝑃 𝑇 . The Por-
target language. For the first transliteration step, tuguese partition was transliterated to Galician so
we also tested a more complex strategy by combin- as to build 𝐸𝑆–𝐺𝐿 and 𝐸𝑁 –𝐺𝐿 parallel corpora.
ing PT→GL Apertium translator [2], which uses In addition, we also added the Spanish-Galician par-
a basic bilingual dictionary to translate word by tition of CLUVI5 , to the 𝐸𝑆–𝐺𝐿 corpus, containing
word, with the transliteration tool for those words 144 thousand sentences.
that are not in the bilingual dictionary.
The NMT system that we use for ES–GL and
EN–GL translations was created using OpenNMT 4. Test results
[11], a generic deep learning framework for creating
Table 1 show the results of different experiments for
sequence-to-sequence models in machine translation.
𝐸𝑆–𝐺𝐿 and 𝐸𝑁 –𝐺𝐿 combining the system, LSTM
In particular, we trained a LSTM (long short term
or Transformer, with the size of the corpus. We
memory) seq2seq model as well as a Transformer
observe that LSTM works very well for close lan-
model for each language pair.
guages (𝐸𝑆–𝐺𝐿), but for the pair (𝐸𝑁 –𝐺𝐿), two
Concerning LSTM, we used the following default
distant languages, the results are slightly better
neural network training parameters: two hidden lay-
with Transformer. In addition, we also observe
ers, 500 hidden LSTM units per layer, input feeding
that the whole OpenSubtitles corpus hurts the per-
enabled, 13 epochs, batch size of 64. Alternatively,
formance in 𝐸𝑆–𝐺𝐿. The best results in 𝐸𝑆–𝐺𝐿
we modified the default learning step parameters to
combine Europarl with OpenSubtitles and are com-
100,000 training steps and 10,000 validation steps.
parable to the state-of-the-art [15]. Let us note that
Traditional tokenization was performed with Lin-
the Movie and TV subtitles of OpenSubtitles are
guakit [12]
a highly valuable resource but the quality of the
The Transformer implementation, described in
resulting sentence alignments is often lower than for
Garg et al. [13], was configured with default training
other parallel corpora [16]. The results in Table 1
parameters: 6 layers for both encoding and decoding
allow us to confirm that using transliteration be-
and batch size of 4096 tokens. We also modified the
tween two closely aligned languages like Portuguese
learning step parameters to the same values as the
and Galician, favorable outcomes can be achieved.
LSTM configuration. In this case, we used sub-word
tokenization, performed with SentencePiece [14].
5. Demonstration
3. Corpora Our demonstration is made up of a public-facing
web page6 that provides Galician translations for
The main parallel sources we used to train the NMT
both Spanish and English inputs. Users will be
system come from Opus2 . In particular we used the
able to test the system via an open web interface
𝐸𝑆–𝑃 𝑇 and 𝐸𝑁 –𝑃 𝑇 partitions of both Europarl3 ,
(see Figure 1) where they could select the language
with about 2 million sentences per language, and
pair (𝐸𝑆–𝐺𝐿 or 𝐸𝑁 –𝐺𝐿) and translation system
4
https://opus.nlpl.eu/OpenSubtitles.php
2 5
https://opus.nlpl.eu https://repositori.upf.edu/handle/10230/20051
3 6
https://opus.nlpl.eu/Europarl.php https://demos.citius.usc.es/nos_tradutor
93
telligence”, agreement between Xunta de Galicia
and University of Santiago de Compostela, and
grant ED431G2019/04 by the Galician Ministry
of Education, University and Professional Train-
ing, and the European Regional Development Fund
(ERDF/FEDER program).
References
[1] R. Knowles, J. Ortega, P. Koehn, A compari-
son of machine translation paradigms for use
in black-box fuzzy-match repair, in: Proceed-
ings of the AMTA 2018 Workshop on Transla-
tion Quality Estimation and Automatic Post-
Editing, 2018, pp. 249–255.
Figure 1: A screen capture of the web interface. [2] M. L. Forcada, M. Ginestí-Rosell, J. Nordfalk,
J. O’Regan, S. Ortiz-Rojas, J. A. Pérez-Ortiz,
F. Sánchez-Martínez, G. Ramírez-Sánchez,
(LSTM or Transformer) to then enter text and gen- F. M. Tyers, Apertium: a free/open-source
erate translations. platform for rule-based machine translation,
In our demonstration, we plan to show where our Machine translation 25 (2011) 127–144.
system performs well and where it does not perform [3] D. Bahdanau, K. Cho, Y. Bengio, Neural
well. As an example, the sentence translated from machine translation by jointly learning to align
Spanish to Galician using the LSTM system in Table and translate, arXiv preprint arXiv:1409.0473
2 is an excellent translation despite its long length. (2014).
Additionally, our system translations perform well [4] P. Koehn, R. Knowles, Six challenges for
with syntax and seem to generally translate better neural machine translation, arXiv preprint
than previous systems tested on the same domain. arXiv:1706.03872 (2017).
Nonetheless, we have found that when comparing [5] J. Gu, Y. Wang, K. Cho, V. O. Li, Improved
our system’s performance for lexical and morpholog- zero-shot neural machine translation via ig-
ical quality, the Portuguese transliteration affect the noring spurious correlations, in: Proceedings
performance, found to be better on other rule-based of the 57th Annual Meeting of the Associa-
MT systems like Apertium [2] for example. tion for Computational Linguistics, Associa-
tion for Computational Linguistics, Florence,
Italy, 2019, pp. 1258–1268. URL: https://
6. Future work aclanthology.org/P19-1121. doi:10.18653/v1/
P19-1121.
We plan to perform further work with a human-in- [6] J. R. P. Campos, P. M. Fernández, O. Gomez,
the-loop to increase the performance based on qual- P. Gamallo, A. C. García, Carvalho: English-
ity. This is outlined by a continuous improvement galician smt system from europarl english-
plan which insinuates the inclusion of translators portuguese parallel corpus, Procesamiento Del
for user functionality tests. For example, spelling Lenguaje Natural (2009) 379–381.
and lexical issues such as acidente instead of acci- [7] J. E. Ortega, R. C. Mamani, K. Cho, Neural
dente, formal Galician differences that need to be ad- machine translation with a polysynthetic low
dressed are first to be solved using newly-developed resource language, Machine Translation 34
heuristics as part of our future contingency plan. (2020) 325–346.
The aim will be to create the highest-quality sys- [8] J. E. Ortega, R. A. Castro-Mamani, J. R.
tem in order expand the language pairs to other Montoya Samame, Overcoming resistance:
languages such as Russian or Chinese. The normalization of an Amazonian tribal lan-
guage, in: Proceedings of the 3rd Workshop
on Technologies for MT of Low Resource Lan-
Acknowledgments guages, Association for Computational Lin-
This research was funded by the project “Nós: Gali- guistics, Suzhou, China, 2020, pp. 1–13. URL:
cian in the society and economy of artificial in- https://aclanthology.org/2020.loresmt-1.1.
94
Spanish Galician
Debemos imponer el cumplimiento de los reglamentos Temos de impor o cumpremento dos regulamentos e celar
y velar por que se aplique el principio de que “el que por que o principio do poluidor-pagador sexa aplicado para
contamina paga” para que se utilicen sanciones y también que sexan utilizadas sancións e tamén incentivos finan-
incentivos financieros a fin de presionar a los propietarios ceiros a fin de exercer presión sobre os proprietarios dos
de los buques y las compañías petroleras y lograr que se navíos e das compañías petrolíferas e conseguir que os
introduzcan los procedimientos mejores. procedementos mellores sexan introducidos.
Table 2
Translation using the best performing machine translation system (LSTM).
[9] J. R. Pichel, P. Gamallo, I. Alegria, M. Neves, URL: https://aclanthology.org/L18-1275.
A methodology to measure the diachronic lan-
guage distance between three languages based
on perplexity, Journal of Quantitative Linguis-
tics 28 (2021) 306–336.
[10] K. Knight, J. Graehl, Machine transliteration,
arXiv preprint cmp-lg/9704003 (1997).
[11] G. Klein, Y. Kim, Y. Deng, J. Senellart,
A. Rush, OpenNMT: Open-source toolkit for
neural machine translation, in: Proceedings
of ACL 2017, System Demonstrations., As-
sociation for Computational Linguistics, Van-
couver, Canada, 2017, pp. 67–72. URL: https:
//www.aclweb.org/anthology/P17-4012.
[12] P. Gamallo, M. Garcia, C. Piñeiro,
R. Martinez-Castaño, J. C. Pichel, Lin-
guaKit: A Big Data-Based Multilingual
Tool for Linguistic Analysis and Information
Extraction, in: 2018 Fifth International
Conference on Social Networks Analysis,
Management and Security (SNAMS), 2018, pp.
239–244. doi:10.1109/SNAMS.2018.8554689.
[13] S. Garg, S. Peitz, U. Nallasamy, M. Paulik,
Jointly learning to align and translate with
transformer models, CoRR abs/1909.02074
(2019). URL: http://arxiv.org/abs/1909.02074.
arXiv:1909.02074.
[14] T. Kudo, J. Richardson, Sentencepiece: A
simple and language independent subword tok-
enizer and detokenizer for neural text process-
ing, arXiv preprint arXiv:1808.06226 (2018).
[15] M. D. C. Bayón, P. Sánchez-Gijón, Evaluating
machine translation in a low-resource language
combination: Spanish-galician., in: Machine
Translation Summit XVII Vol. 2: Translator,
Project and User Tracks, 2019, pp. 30–35.
[16] P. Lison, J. Tiedemann, M. Kouylekov, Open-
Subtitles2018: Statistical rescoring of sentence
alignments in large, noisy parallel corpora, in:
Proceedings of the Eleventh International Con-
ference on Language Resources and Evaluation
(LREC 2018), European Language Resources
Association (ELRA), Miyazaki, Japan, 2018.
95