<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TAN-IBE: Neural Machine Translation for the Romance Languages of the Iberian Peninsula</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antoni Oliver</string-name>
          <email>aoliverg@uoc.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mercè Vàzquez</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marta Coll-Florit</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergi Álvarez</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Víctor Suárez</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudi Aventín-Boya</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Valdés</string-name>
          <email>cris@uniovi.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mar Font</string-name>
          <email>mar.font@udl.cat</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro Pardos</string-name>
          <email>apardoscalvo@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad de Oviedo. Campus de Humanidades "El Milán", C/ Amparo Pedregal</institution>
          ,
          <addr-line>s/n, 33011 Oviedo</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad de Zaragoza.</institution>
          <addr-line>Pedro Cerbuna 12 50009 Zaragoza</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universitat Oberta de Catalunya (UOC). Rambla del Poblenou</institution>
          ,
          <addr-line>156 08018 Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universitat de Lleida. Plaça de Víctor Siurana</institution>
          ,
          <addr-line>1, 25003 Lleida</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the project TAN-IBE: Neural Machine Translation for the Romance Languages of the Iberian Peninsula, a three-year research project. Its main objective is to conduct research on techniques for training NMT systems for these languages, as there are high, medium and low resource languages among them. Particular attention will be paid to the languages with fewer resources: Asturian, Aragonese and Aranese. 1. Funding institution and duration The TAN-IBE project: Neural Machine Translation for the Romance Languages of the Iberian Peninsula is a research project funded by the Spanish Ministry of Science and Innovation in the call for proposals Proyectos de generación de conocimiento 2021. The project has a duration of 3 years and it started in September 2022.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Romance languages</kwd>
        <kwd>neural machine translation</kwd>
        <kwd>parallel corpora</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>2. Project participants</title>
      <sec id="sec-1-1">
        <title>The following institutions are involved in the TAN-IBE</title>
        <p>project: Universitat Oberta de Catalunya1 (UOC) which
leads the project and is in charge of the training and
evaluation of the neural systems; Universidad de Oviedo2,
which is mainly in charge of the compilation of the
corpora for Asturian; Universidad de Zaragoza3, which is
mainly in charge of the compilation of the corpora for
Aragonese and Universitat de Lleida4 (UdL), which is
mainly responsible for the compilation of the corpora for
Aranese.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Motivation and background</title>
      <sec id="sec-2-1">
        <title>3.1. Romance languages of the Iberian</title>
      </sec>
      <sec id="sec-2-2">
        <title>Peninsula</title>
        <sec id="sec-2-2-1">
          <title>There is a large number of Romance languages on the</title>
          <p>Iberian Peninsula. In this project we will consider the
following: Spanish, Portuguese, Catalan, Galician, Asturian,
Aragonese and Aranese. This list could be extended by
other languages and varieties. These languages are very
disparate in terms of their oficial status and the number
of speakers. These two factors, oficial status and number
of speakers, correlate in most cases with the linguistic
resources (especially for this project we are interested
in parallel corpora) and the number and quality of the
machine translation systems available. As far as
oficiality is concerned, we could distinguish three levels:
state oficiality (oficiality in an entire state of the Iberian
Peninsula), autonomous or regional oficiality (oficiality
in an autonomous or regional region or at least part of it),
and international oficiality (oficiality in international
institutions such as the European Union or the United
Nations). Table 1 shows the level of oficiality and the
approximate number of speakers of these languages on
the Iberian Peninsula.</p>
          <p>For example, Catalan is oficial in the state of
Andorra and oficial in several autonomous communities</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>4https://www.udl.cat/ca/</title>
          <p>and Aranese is oficial in the entire territory of the
autonomous community of Catalonia.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.2. Existing linguistic resources</title>
        <sec id="sec-2-3-1">
          <title>In table 2 we can observe the approximate total number of segments in the parallel corpora available in the OPUS collection between Spanish and the other languages under study.</title>
          <p>Language
Portuguese
Catalan
Galician
Asturian
Aranese</p>
          <p>Segments
analyze three specific systems: Apertium, which is a
shallow syntactic transfer system distributed under a
free license; Google Translate, a very popular neural
machine translation system that provides numerous
language pairs; and DeepL, a commercial neural system
that is also well known for its quality. In Table 4 we
can observe the systems from Spanish to the rest of the
languages in this study.
spa
por
cat
gal
ast
arg
oci*</p>
          <p>Speakers</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Another interesting resource for training machine</title>
          <p>translation engines are monolingual corpora, since there
are techniques capable of training systems using mono- Apertium GoogleT DeepL
lingual corpora. For Spanish, Portuguese, Catalan and Portuguese X X X
Galician, large amounts of text can be easily collected Catalan X X
from Common Crawl, which periodically downloads all Galician X X
web content and makes the downloaded data available. A Asturian X
language detection algorithm is applied to this download Aragonese X
to request the data for a given language. Unfortunately, Aranese X
for the rest of the languages under study no data is
available, as the language detector used is not trained to detect Table 4
these languages. Another possible source of monolingual Availability of Spanish to other languages for three widely
corpora is Wikipedia, which has versions for all the lan- used machine translation systems.
guages in this project (with the exception of Aranese,
which could experimentally use Occitan data). Table 3 As can be seen from Table 4, only three languages
(Porshows the number of Wikipedia articles for each of the tuguese, Catalan and Galician) have a neural machine
project languages. translation system with Spanish as the source language.</p>
          <p>
            With regard to the machine translation systems avail- Currently, the predominant methodology and the one
able between Spanish and the other languages, we will that achieves better quality is neural machine translation
[
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. Thus, most of the Romance languages under study do some of these language combinations (e.g. A-B, A-C,
not have machine translation systems using this method- A-D, B-C and B-D) we can train a machine translation
ology. Neural machine translation systems are trained system that can translate between all pairs, regardless
using parallel corpora of good quality and large size. The of the fact that for some of the language pairs there is
data in Table 2 are not encouraging for languages that do no parallel corpus available (e.g. the C-D pair in our
exnot have neural machine translation systems, as there are ample). This is possible because the resulting system is
no corpora of suficient size for these languages. There able to use the similarities between the languages. This
is therefore an urgent need for larger parallel corpora for configuration can be very useful to train systems for
lanthese languages. guage pairs with few resources while training language
pairs with more resources. In our project the
Spanish3.3. Training strategies for Portuguese, Spanish-Catalan and Spanish-Galician pairs
would be the resource-rich pairs; while Spanish-Asturian,
under-resourced language pairs Spanish-Aragonese and Spanish-Aranese would be the
In recent years, there has been considerable interest in resource-poor pairs. This same configuration could
prothe development of methodologies for training neural duce translation systems for pairs without any parallel
machine translation systems for language pairs with very corpus, such as Asturian-Aranese. This is called zero-shot
few resources. translation. In [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] it is shown that the quality of these
          </p>
          <p>
            Four major groups of strategies can be distinguished: zero-shot translations can be significantly improved if a
neural machine translation based on transfer learning; few parallel segments of the C-D pair (Asturian-Aranese,
multilingual machine translation; self-supervised ma- in the example above) are available. In [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] it is
emphachine translation; and unsupervised machine translation. sized that most multilingual systems take English as the
During the project we intend to explore the first two core language, since they are trained only with
paralstrategies. lel corpora consisting of texts that have been translated
from English or into English. In their work they show
3.3.1. MT based on transfer learning that and improvement of up to 10 points in BLEU can
be achieved by using non-English-centric models in the
We want to train a machine translation system from lan- translation of non-English language pairs. This work is
guage A to language C, but this language pair has very important for our project, as the core language will not
few parallel segments available. But there is a language be English, which is not the language we intend to work
B, which is closely related to language C (for example, with in this project. Another aspect that has occupied the
they are close languages of the same family, like the attention of researchers is the influence of typological
working languages of this project) and we have large diferences between the languages involved in a
multiparallel corpora between language A and B. Using so- lingual system. In some studies [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] backtranslation is
called transfer learning, we start by training a neural used in multilingual systems to improve the translation
system from language A to language B and, once the quality of language pairs for which no parallel corpus is
training is finished, we continue training it using a cor- available. The technique of backtranslation [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] consists
pus of the language pair B-C [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. In [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] a modification of using monolingual corpora of the target language (B)
to this methodology using vocabulary overlap between to create a parallel corpus where sentences in the source
these languages is introduced. To increase the overlap language (A) are obtained using a machine translation
in the vocabulary, they split the words into subwords system of the B-A language pair. This new synthetic
using BPE (Byte Pair Encoding) [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. They then train the parallel corpus is added to the available real A-B parallel
A-B system and transfer the parameters including word corpus and both models are used to train the new A-B
embeddings from the source language to another model machine translation system. It is important to note that
and continue training the B-C system. In the TAN-IBE the only synthetic part of the synthetic parallel corpus
project a Spanish-Aranese system could be trained by obtained by backtranslation is the part corresponding to
ifrst training a Spanish-Catalan with a large corpus and the source language (A), since the part corresponding
once trained continue training with the Catalan-Aranese to the target language (B) has been obtained from real
corpus. language texts.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Goals of the project</title>
      <p>3.3.2. Multilingual MT</p>
      <sec id="sec-3-1">
        <title>Mutilingual machine translation systems [5] allow us</title>
        <p>
          to train a single neural system that shares a single at- The main objective of the project is the design, training
tention mechanism. Imagine we are working with lan- and evaluation of neural machine translation systems
guages A, B, C and D. If we have a parallel corpus for between the Romance languages of the Iberian Peninsula.
This main objective can be divided into the following with the media, publishers and institutions such as the
specific objectives: Academia de la Llingua Asturiana, the Directorate
General of Language Policy of the Principality of Asturias and
• To compile parallel and monolingual corpora for the linguistic normalization services of the city councils
the languages of the project, with a special efort of Gijón and Corvera. We would also like to highlight the
for Asturian, Aragonese and Aranese. ESLEMA material provided by researchers from the
Uni• To explore new techniques for training neural versity of Oviedo and the compilation of various literary
translation engines. works.
• To train neural translation systems between Span- The selection and preparation of the corpus for
ish and the other languages of the project, in both Aragonese has been conditioned by the fact that it is
directions. a minority language. Among other factors, we can
high• To train multilingual systems capable of translat- light the lack of linguistic standardization, the absence
ing to and from all the languages of the project. of a reference institution regarding the proper use of the
• To evaluate all trained systems using automatic language or the diversity of orthographic rules used by
metrics and compare them with existing machine the diferent associations and organizations. There is
translation systems. abundant literature on the early years of the renaxedura
• To perform human evaluations of the trained sys- de l’aragonés (the rebirth of the language, mainly in the
tems between Spanish and Asturian, Aragonese 1980s), in which a large number of books, magazines and
and Aranese. journals were published, and a downward trend in the
• To create guides and scripts that facilitate the corpus observed from the second half of the 2000s until
training of neural machine translation systems. 2015. The lack of institutional recognition, internal
discordance between associations and the limited presence
• To publish the results of the TAN-IBE project of the language on the Internet or media can be pointed
under free licenses. out as the main factors. However, the creation in 2015 of
the Directorate General of Language Policy of the
Gov5. Summary of results to date ernment of Aragon has significantly increased the corpus
by promoting the use of this language in education,
literDuring the first months, the activity has focused on ature, the Internet, the media, university and scientific
rethe compilation of linguistic resources for Asturian, search and reaching a better agreement on orthographic
Aragonese and Aranese. Several scripts and programs rules and linguistic standardization. The assistance of
have also been developed to facilitate the task of compil- the Directorate General for Language Policy has been
ing parallel corpora. fundamental, since it has provided a large corpus, largely
composed of monolingual texts, but also containing texts
5.1. Scripts and programs in Spanish and their translation into Aragonese. Most of
them are translations of legal documents and laws, but
Some of the larger parallel corpora for the languages of also educational material and literature. The institution
the project contain numerous errors: many segments are also provided a large database with the contents of the
not in the required languages and many others are not Aragonario (the reference dictionary of the Aragonese
translation equivalents. To filter out incorrect segments language), which contains the translation of practically
we have developed a script that reverifies the languages all known words in Aragonese. Finally, it should be noted
and applies a score based on SBERT [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to detect mis- that the participation of three of the four most relevant
aligned segments. To facilitate the alignment of parallel publishers in the Aragonese language has been important
corpora and the search for parallel segments in compa- in order to have a really limited corpus on the literary
rable corpora we have developed a set of programs that field published in recent years.
facilitate the process using Hunalign [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and SBERT. As for Aranese, the work carried out to date has
involved starting the compilation from the normative
docu5.2. Corpora ments up to the current approval and first standardization
of this language, which date from the period after 1982,
We have developed the FLORES-200 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] corpus for discarding the previous ones. For this reason, we have
Aragonese and Aranese, and have thoroughly revised obtained texts in standardized Aranese from Aranese
the Asturian version, because it contained errors. newspapers of the last thirty years. We have continued
        </p>
        <p>For the creation of the parallel Spanish-Asturian cor- with the publications of the few existing Aranese
writpus we are using various sources, mainly those avail- ers who have ofered us their entire bibliography, some
able on the Internet such as legal texts, web pages monographs and online editions that have provided their
and Wikipedia, and texts obtained through agreements material for open use: Associació Centre d’Estudis i
Documentació de la Comunicació (UAB), Edicions deth
Conselh (CGA), and other small publishers with whom we
have collaborated, providing their writings in Aranese.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>The project TAN-IBE: Neural Machine Translation for</title>
        <p>the Romance languages of the Iberian Peninsula is funded
by the Spanish Ministry of Science and Innovation.
Reference: PID2021-124663OB-I00 funded by MCIN /AEI
/10.13039/501100011033 / FEDER, EU.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Castilho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Moorkens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gaspari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sosoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Georgakopoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lohar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Way</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Miceli-Barone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gialama</surname>
          </string-name>
          ,
          <article-title>A comparative quality evaluation of pbsmt and nmt using professional translators</article-title>
          ,
          <source>in: Proceedings of Machine Translation Summit XVI: Research Track</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>116</fpage>
          -
          <lpage>131</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yuret</surname>
          </string-name>
          , J. May,
          <string-name>
            <given-names>K.</given-names>
            <surname>Knight</surname>
          </string-name>
          ,
          <article-title>Transfer learning for low-resource neural machine translation</article-title>
          ,
          <source>in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1568</fpage>
          -
          <lpage>1575</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. Q.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chiang</surname>
          </string-name>
          ,
          <article-title>Transfer learning across low-resource, related languages for neural machine translation</article-title>
          ,
          <source>in: Proceedings of the Eighth International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>296</fpage>
          -
          <lpage>301</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haddow</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Birch,</surname>
          </string-name>
          <article-title>Neural machine translation of rare words with subword units, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2016</year>
          , pp.
          <fpage>1715</fpage>
          -
          <lpage>1725</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Firat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          <article-title>, Multi-way, multilingual neural machine translation with a shared attention mechanism, in: 15th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          ,
          <source>NAACL HLT</source>
          <year>2016</year>
          ,
          <article-title>Association for Computational Linguistics (ACL</article-title>
          ),
          <year>2016</year>
          , pp.
          <fpage>866</fpage>
          -
          <lpage>875</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krikun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Thorat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Viégas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wattenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Corrado</surname>
          </string-name>
          , et al.,
          <article-title>Google's multilingual neural machine translation system: Enabling zero-shot translation, Transactions of the Association for Computational Linguistics 5 (</article-title>
          <year>2017</year>
          )
          <fpage>339</fpage>
          -
          <lpage>351</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          , A.
          <string-name>
            <surname>El-Kishky</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Baines</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Celebi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Wenzek</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Chaudhary</surname>
          </string-name>
          , et al.,
          <article-title>Beyond english-centric multilingual machine translation</article-title>
          ,
          <source>The Journal of Machine Learning Research</source>
          <volume>22</volume>
          (
          <year>2021</year>
          )
          <fpage>4839</fpage>
          -
          <lpage>4886</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Titov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          ,
          <article-title>Improving massively multilingual neural machine translation and zero-shot translation, in: 2020 Annual Conference of the Association for Computational Linguistics, Association for Computational Linguistics (ACL</article-title>
          ),
          <year>2020</year>
          , pp.
          <fpage>1628</fpage>
          -
          <lpage>1639</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haddow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <article-title>Improving neural machine translation models with monolingual data, in: 54th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (ACL</article-title>
          ),
          <year>2016</year>
          , pp.
          <fpage>86</fpage>
          -
          <lpage>96</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          . URL: https: //arxiv.org/abs/
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Varga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halácsy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kornai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Nagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Németh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Trón</surname>
          </string-name>
          ,
          <article-title>Parallel corpora for medium density languages</article-title>
          ,
          <source>in: Recent Advances in Natural Language Processing IV</source>
          , John Benjamins,
          <year>2007</year>
          , pp.
          <fpage>247</fpage>
          -
          <lpage>258</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ranzato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <article-title>The flores-101 evaluation benchmark for low-resource and multilingual machine translation</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>522</fpage>
          -
          <lpage>538</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>