2. Project participants

TAN-IBE: Neural Machine Translation for the Romance Languages of the Iberian Peninsula

Antoni Oliver

aoliverg@uoc.edu 2

Mercè Vàzquez

Marta Coll-Florit

Sergi Álvarez

Víctor Suárez

Claudi Aventín-Boya

Cristina Valdés

cris@uniovi.es 0

Mar Font

mar.font@udl.cat 3

Alejandro Pardos

apardoscalvo@gmail.com 1 0 Universidad de Oviedo. Campus de Humanidades "El Milán", C/ Amparo Pedregal , s/n, 33011 Oviedo , Spain 1 Universidad de Zaragoza. Pedro Cerbuna 12 50009 Zaragoza , Spain 2 Universitat Oberta de Catalunya (UOC). Rambla del Poblenou , 156 08018 Barcelona , Spain 3 Universitat de Lleida. Plaça de Víctor Siurana , 1, 25003 Lleida , Spain

This paper describes the project TAN-IBE: Neural Machine Translation for the Romance Languages of the Iberian Peninsula, a three-year research project. Its main objective is to conduct research on techniques for training NMT systems for these languages, as there are high, medium and low resource languages among them. Particular attention will be paid to the languages with fewer resources: Asturian, Aragonese and Aranese. 1. Funding institution and duration The TAN-IBE project: Neural Machine Translation for the Romance Languages of the Iberian Peninsula is a research project funded by the Spanish Ministry of Science and Innovation in the call for proposals Proyectos de generación de conocimiento 2021. The project has a duration of 3 years and it started in September 2022.

eol>Romance languages neural machine translation parallel corpora

2. Project participants The following institutions are involved in the TAN-IBE

project: Universitat Oberta de Catalunya1 (UOC) which leads the project and is in charge of the training and evaluation of the neural systems; Universidad de Oviedo2, which is mainly in charge of the compilation of the corpora for Asturian; Universidad de Zaragoza3, which is mainly in charge of the compilation of the corpora for Aragonese and Universitat de Lleida4 (UdL), which is mainly responsible for the compilation of the corpora for Aranese.

3. Motivation and background 3.1. Romance languages of the Iberian Peninsula There is a large number of Romance languages on the

Iberian Peninsula. In this project we will consider the following: Spanish, Portuguese, Catalan, Galician, Asturian, Aragonese and Aranese. This list could be extended by other languages and varieties. These languages are very disparate in terms of their oficial status and the number of speakers. These two factors, oficial status and number of speakers, correlate in most cases with the linguistic resources (especially for this project we are interested in parallel corpora) and the number and quality of the machine translation systems available. As far as oficiality is concerned, we could distinguish three levels: state oficiality (oficiality in an entire state of the Iberian Peninsula), autonomous or regional oficiality (oficiality in an autonomous or regional region or at least part of it), and international oficiality (oficiality in international institutions such as the European Union or the United Nations). Table 1 shows the level of oficiality and the approximate number of speakers of these languages on the Iberian Peninsula.

For example, Catalan is oficial in the state of Andorra and oficial in several autonomous communities

4https://www.udl.cat/ca/

and Aranese is oficial in the entire territory of the autonomous community of Catalonia.

3.2. Existing linguistic resources In table 2 we can observe the approximate total number of segments in the parallel corpora available in the OPUS collection between Spanish and the other languages under study.

Language Portuguese Catalan Galician Asturian Aranese

Segments analyze three specific systems: Apertium, which is a shallow syntactic transfer system distributed under a free license; Google Translate, a very popular neural machine translation system that provides numerous language pairs; and DeepL, a commercial neural system that is also well known for its quality. In Table 4 we can observe the systems from Spanish to the rest of the languages in this study. spa por cat gal ast arg oci*

Speakers

Another interesting resource for training machine

translation engines are monolingual corpora, since there are techniques capable of training systems using mono- Apertium GoogleT DeepL lingual corpora. For Spanish, Portuguese, Catalan and Portuguese X X X Galician, large amounts of text can be easily collected Catalan X X from Common Crawl, which periodically downloads all Galician X X web content and makes the downloaded data available. A Asturian X language detection algorithm is applied to this download Aragonese X to request the data for a given language. Unfortunately, Aranese X for the rest of the languages under study no data is available, as the language detector used is not trained to detect Table 4 these languages. Another possible source of monolingual Availability of Spanish to other languages for three widely corpora is Wikipedia, which has versions for all the lan- used machine translation systems. guages in this project (with the exception of Aranese, which could experimentally use Occitan data). Table 3 As can be seen from Table 4, only three languages (Porshows the number of Wikipedia articles for each of the tuguese, Catalan and Galician) have a neural machine project languages. translation system with Spanish as the source language.

With regard to the machine translation systems avail- Currently, the predominant methodology and the one able between Spanish and the other languages, we will that achieves better quality is neural machine translation [ 1 ]. Thus, most of the Romance languages under study do some of these language combinations (e.g. A-B, A-C, not have machine translation systems using this method- A-D, B-C and B-D) we can train a machine translation ology. Neural machine translation systems are trained system that can translate between all pairs, regardless using parallel corpora of good quality and large size. The of the fact that for some of the language pairs there is data in Table 2 are not encouraging for languages that do no parallel corpus available (e.g. the C-D pair in our exnot have neural machine translation systems, as there are ample). This is possible because the resulting system is no corpora of suficient size for these languages. There able to use the similarities between the languages. This is therefore an urgent need for larger parallel corpora for configuration can be very useful to train systems for lanthese languages. guage pairs with few resources while training language pairs with more resources. In our project the Spanish3.3. Training strategies for Portuguese, Spanish-Catalan and Spanish-Galician pairs would be the resource-rich pairs; while Spanish-Asturian, under-resourced language pairs Spanish-Aragonese and Spanish-Aranese would be the In recent years, there has been considerable interest in resource-poor pairs. This same configuration could prothe development of methodologies for training neural duce translation systems for pairs without any parallel machine translation systems for language pairs with very corpus, such as Asturian-Aranese. This is called zero-shot few resources. translation. In [ 6 ] it is shown that the quality of these

Four major groups of strategies can be distinguished: zero-shot translations can be significantly improved if a neural machine translation based on transfer learning; few parallel segments of the C-D pair (Asturian-Aranese, multilingual machine translation; self-supervised ma- in the example above) are available. In [ 7 ] it is emphachine translation; and unsupervised machine translation. sized that most multilingual systems take English as the During the project we intend to explore the first two core language, since they are trained only with paralstrategies. lel corpora consisting of texts that have been translated from English or into English. In their work they show 3.3.1. MT based on transfer learning that and improvement of up to 10 points in BLEU can be achieved by using non-English-centric models in the We want to train a machine translation system from lan- translation of non-English language pairs. This work is guage A to language C, but this language pair has very important for our project, as the core language will not few parallel segments available. But there is a language be English, which is not the language we intend to work B, which is closely related to language C (for example, with in this project. Another aspect that has occupied the they are close languages of the same family, like the attention of researchers is the influence of typological working languages of this project) and we have large diferences between the languages involved in a multiparallel corpora between language A and B. Using so- lingual system. In some studies [ 8 ] backtranslation is called transfer learning, we start by training a neural used in multilingual systems to improve the translation system from language A to language B and, once the quality of language pairs for which no parallel corpus is training is finished, we continue training it using a cor- available. The technique of backtranslation [ 9 ] consists pus of the language pair B-C [ 2 ]. In [ 3 ] a modification of using monolingual corpora of the target language (B) to this methodology using vocabulary overlap between to create a parallel corpus where sentences in the source these languages is introduced. To increase the overlap language (A) are obtained using a machine translation in the vocabulary, they split the words into subwords system of the B-A language pair. This new synthetic using BPE (Byte Pair Encoding) [ 4 ]. They then train the parallel corpus is added to the available real A-B parallel A-B system and transfer the parameters including word corpus and both models are used to train the new A-B embeddings from the source language to another model machine translation system. It is important to note that and continue training the B-C system. In the TAN-IBE the only synthetic part of the synthetic parallel corpus project a Spanish-Aranese system could be trained by obtained by backtranslation is the part corresponding to ifrst training a Spanish-Catalan with a large corpus and the source language (A), since the part corresponding once trained continue training with the Catalan-Aranese to the target language (B) has been obtained from real corpus. language texts.

4. Goals of the project

3.3.2. Multilingual MT

Mutilingual machine translation systems [5] allow us

to train a single neural system that shares a single at- The main objective of the project is the design, training tention mechanism. Imagine we are working with lan- and evaluation of neural machine translation systems guages A, B, C and D. If we have a parallel corpus for between the Romance languages of the Iberian Peninsula. This main objective can be divided into the following with the media, publishers and institutions such as the specific objectives: Academia de la Llingua Asturiana, the Directorate General of Language Policy of the Principality of Asturias and • To compile parallel and monolingual corpora for the linguistic normalization services of the city councils the languages of the project, with a special efort of Gijón and Corvera. We would also like to highlight the for Asturian, Aragonese and Aranese. ESLEMA material provided by researchers from the Uni• To explore new techniques for training neural versity of Oviedo and the compilation of various literary translation engines. works. • To train neural translation systems between Span- The selection and preparation of the corpus for ish and the other languages of the project, in both Aragonese has been conditioned by the fact that it is directions. a minority language. Among other factors, we can high• To train multilingual systems capable of translat- light the lack of linguistic standardization, the absence ing to and from all the languages of the project. of a reference institution regarding the proper use of the • To evaluate all trained systems using automatic language or the diversity of orthographic rules used by metrics and compare them with existing machine the diferent associations and organizations. There is translation systems. abundant literature on the early years of the renaxedura • To perform human evaluations of the trained sys- de l’aragonés (the rebirth of the language, mainly in the tems between Spanish and Asturian, Aragonese 1980s), in which a large number of books, magazines and and Aranese. journals were published, and a downward trend in the • To create guides and scripts that facilitate the corpus observed from the second half of the 2000s until training of neural machine translation systems. 2015. The lack of institutional recognition, internal discordance between associations and the limited presence • To publish the results of the TAN-IBE project of the language on the Internet or media can be pointed under free licenses. out as the main factors. However, the creation in 2015 of the Directorate General of Language Policy of the Gov5. Summary of results to date ernment of Aragon has significantly increased the corpus by promoting the use of this language in education, literDuring the first months, the activity has focused on ature, the Internet, the media, university and scientific rethe compilation of linguistic resources for Asturian, search and reaching a better agreement on orthographic Aragonese and Aranese. Several scripts and programs rules and linguistic standardization. The assistance of have also been developed to facilitate the task of compil- the Directorate General for Language Policy has been ing parallel corpora. fundamental, since it has provided a large corpus, largely composed of monolingual texts, but also containing texts 5.1. Scripts and programs in Spanish and their translation into Aragonese. Most of them are translations of legal documents and laws, but Some of the larger parallel corpora for the languages of also educational material and literature. The institution the project contain numerous errors: many segments are also provided a large database with the contents of the not in the required languages and many others are not Aragonario (the reference dictionary of the Aragonese translation equivalents. To filter out incorrect segments language), which contains the translation of practically we have developed a script that reverifies the languages all known words in Aragonese. Finally, it should be noted and applies a score based on SBERT [ 10 ] to detect mis- that the participation of three of the four most relevant aligned segments. To facilitate the alignment of parallel publishers in the Aragonese language has been important corpora and the search for parallel segments in compa- in order to have a really limited corpus on the literary rable corpora we have developed a set of programs that field published in recent years. facilitate the process using Hunalign [ 11 ] and SBERT. As for Aranese, the work carried out to date has involved starting the compilation from the normative docu5.2. Corpora ments up to the current approval and first standardization of this language, which date from the period after 1982, We have developed the FLORES-200 [ 12 ] corpus for discarding the previous ones. For this reason, we have Aragonese and Aranese, and have thoroughly revised obtained texts in standardized Aranese from Aranese the Asturian version, because it contained errors. newspapers of the last thirty years. We have continued

For the creation of the parallel Spanish-Asturian cor- with the publications of the few existing Aranese writpus we are using various sources, mainly those avail- ers who have ofered us their entire bibliography, some able on the Internet such as legal texts, web pages monographs and online editions that have provided their and Wikipedia, and texts obtained through agreements material for open use: Associació Centre d’Estudis i Documentació de la Comunicació (UAB), Edicions deth Conselh (CGA), and other small publishers with whom we have collaborated, providing their writings in Aranese.

Acknowledgments The project TAN-IBE: Neural Machine Translation for

the Romance languages of the Iberian Peninsula is funded by the Spanish Ministry of Science and Innovation. Reference: PID2021-124663OB-I00 funded by MCIN /AEI /10.13039/501100011033 / FEDER, EU.

[1]

Castilho ,

Moorkens ,

Gaspari ,

Sennrich ,

Sosoni ,

Georgakopoulou ,

Lohar ,

Way ,

A. V.

Miceli-Barone ,

Gialama , A comparative quality evaluation of pbsmt and nmt using professional translators , in: Proceedings of Machine Translation Summit XVI: Research Track , 2017 , pp. 116 - 131 .

[2]

Zoph ,

Yuret , J. May,

Knight , Transfer learning for low-resource neural machine translation , in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , 2016 , pp. 1568 - 1575 .

[3]

T. Q.

Nguyen ,

Chiang , Transfer learning across low-resource, related languages for neural machine translation , in: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2 : Short

Papers)

, 2017 , pp. 296 - 301 .

[4]

Sennrich ,

Haddow , A. Birch, Neural machine translation of rare words with subword units, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, 2016 , pp. 1715 - 1725 .

[5]

Firat ,

Cho ,

Bengio , Multi-way, multilingual neural machine translation with a shared attention mechanism, in: 15th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , NAACL HLT 2016 , Association for Computational Linguistics (ACL ), 2016 , pp. 866 - 875 .

[6]

Johnson ,

Schuster ,

Q. V.

Le ,

Krikun ,

Wu ,

Chen ,

Thorat ,

Viégas ,

Wattenberg ,

Corrado , et al., Google's multilingual neural machine translation system: Enabling zero-shot translation, Transactions of the Association for Computational Linguistics 5 ( 2017 ) 339 - 351 .

[7]

Fan ,

Bhosale ,

Schwenk ,

Ma , A. El-Kishky , S.

Goyal , M.

Baines , O.

Celebi , G.

Wenzek , V.

Chaudhary , et al., Beyond english-centric multilingual machine translation , The Journal of Machine Learning Research 22 ( 2021 ) 4839 - 4886 .

[8]

Zhang ,

Williams , I. Titov ,

Sennrich , Improving massively multilingual neural machine translation and zero-shot translation, in: 2020 Annual Conference of the Association for Computational Linguistics, Association for Computational Linguistics (ACL ), 2020 , pp. 1628 - 1639 .

[9]

Sennrich ,

Haddow ,

Birch , Improving neural machine translation models with monolingual data, in: 54th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (ACL ), 2016 , pp. 86 - 96 .

[10]

Reimers , I. Gurevych , Sentence-bert: Sentence embeddings using siamese bert-networks , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics , 2019 . URL: https: //arxiv.org/abs/ 1908 .10084.

[11]

Varga ,

Halácsy ,

Kornai ,

Nagy ,

Németh ,

Trón , Parallel corpora for medium density languages , in: Recent Advances in Natural Language Processing IV , John Benjamins, 2007 , pp. 247 - 258 .

[12]

Goyal ,

Gao ,

Chaudhary ,

P.-J.

Chen ,

Wenzek ,

Ju ,

Krishnan ,

Ranzato ,

Guzmán ,

Fan , The flores-101 evaluation benchmark for low-resource and multilingual machine translation , Transactions of the Association for Computational Linguistics 10 ( 2022 ) 522 - 538 .