<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A
Coruña, Spain
$ john.ortega@usc.gal (J. E. Ortega); iria.dedios@usc.gal
(I. de-Dios-Flores); jramon.pichel@usc.gal (J. R. Pichel);
pablo.gamallo@usc.gal (P. Gamallo)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A neural machine translation system for Galician from transliterated Portuguese text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>John E. Ortega</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iria de-Dios-Flores</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Ramom Pichel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo Gamallo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro de Investigación en Tecnoloxías da Información (CITIUS), Universidad de Santiago de Compostela</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>We present a neural machine translation (NMT) system for translating both Spanish and English to Galician ( - and  - ). Galician is a language closely related to Portuguese, with low to medium resources, spoken in northwestern Spain. Our NMT system is trained on large-scale synthetic  →   →  and  →   →  parallel corpora created by the spelling transliteration of Portuguese to Galician from a high-quality Spanish to Portuguese ( -  ) and English to Portuguese ( -  ) translation memories. The NMT system is then made available via a public web interface at https://demos.citius.usc.es/nos_tradutor.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Galician Language</kwd>
        <kwd>Neural Machine Translation</kwd>
        <kwd>Transliteration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>
        Our translation strategy consists of two steps. The
ifrst step uses transliteration [
        <xref ref-type="bibr" rid="ref11">10</xref>
        ] to create parallel
Galician segments from the Portuguese segments in
the aligned corpus, by making using of the
transliteration tool port2gal1, which contains several
hundreds of rules on characters and sequences of
characters. Both training and validation sets are
transliterated leaving a final parallel Galician corpus. Then,
in the second step, the Galician (transliterated)
cor1https://github.com/gamallo/port2gal
system
lstm
lstm
lstm
transformer
transformer
lstm
transformer
pair
es-gl
es-gl
es-gl
es-gl
es-gl
en-gl
en-gl
      </p>
      <p>source</p>
      <sec id="sec-2-1">
        <title>Europarl+CLUVI</title>
      </sec>
      <sec id="sec-2-2">
        <title>Europarl+CLUVI+OpenSubt(part)</title>
      </sec>
      <sec id="sec-2-3">
        <title>Europarl+CLUVI+OpenSubt</title>
      </sec>
      <sec id="sec-2-4">
        <title>Europarl+CLUVI</title>
      </sec>
      <sec id="sec-2-5">
        <title>Europarl+CLUVI+OpenSubt</title>
      </sec>
      <sec id="sec-2-6">
        <title>Europarl+OpenSubt Europarl+OpenSubt</title>
        <p>
          corpus size
2.35M
5M
30M
2.35M
30M
27.M
27.M
bleu
pus is used to train an NMT system with Spanish or OpenSubtitles4, containing about 30 million
senEnglish as the source language and Galician as the tences in  –  and 25 in  –  . The
Portarget language. For the first transliteration step, tuguese partition was transliterated to Galician so
we also tested a more complex strategy by combin- as to build  – and  – parallel corpora.
ing PT→GL Apertium translator [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], which uses In addition, we also added the Spanish-Galician
para basic bilingual dictionary to translate word by tition of CLUVI5, to the  – corpus, containing
word, with the transliteration tool for those words 144 thousand sentences.
that are not in the bilingual dictionary.
        </p>
        <p>
          The NMT system that we use for ES–GL and
EN–GL translations was created using OpenNMT 4. Test results
[
          <xref ref-type="bibr" rid="ref12">11</xref>
          ], a generic deep learning framework for creating
sequence-to-sequence models in machine translation. Table 1 show the results of diferent experiments for
Imnepmaorrtiyc)ulsaerq,2wseeqtrmaiondeedl aasLwSTelMla(sloanTgrsahnosrftortmeremr or T–ransafonrdm er, –withctohmebsinizinegotfhtehseysctoermpu,sL.STWMe
moCdoenl cfeorrneinagchLlSaTnMgu,awgee puaseird. the following default goubsaegrevse(that– LST),Mbuwtoforkrsthveerpyawire(ll for–clos)e,
tlawnoneural network training parameters: two hidden lay- distant languages, the results are slightly better
ers, 500 hidden LSTM units per layer, input feeding with Transformer. In addition, we also observe
enabled, 13 epochs, batch size of 64. Alternatively, that the whole OpenSubtitles corpus hurts the
perwe modified the default learning step parameters to formance in  – . The best results in  –
100,000 training steps and 10,000 validation steps. combine Europarl with OpenSubtitles and are
comTraditional tokenization was performed with Lin- parable to the state-of-the-art [
          <xref ref-type="bibr" rid="ref16">15</xref>
          ]. Let us note that
the Movie and TV subtitles of OpenSubtitles are
guTakhiet [T1r2a]nsformer implementation, described in a highly valuable resource but the quality of the
Garg et al. [
          <xref ref-type="bibr" rid="ref14">13</xref>
          ], was configured with default training resulting sentence alignments is often lower than for
parameters: 6 layers for both encoding and decoding other parallel corpora [
          <xref ref-type="bibr" rid="ref17">16</xref>
          ]. The results in Table 1
and batch size of 4096 tokens. We also modified the allow us to confirm that using transliteration
belearning step parameters to the same values as the tween two closely aligned languages like Portuguese
LSTM configuration. In this case, we used sub-word and Galician, favorable outcomes can be achieved.
tokenization, performed with SentencePiece [
          <xref ref-type="bibr" rid="ref15">14</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Demonstration</title>
    </sec>
    <sec id="sec-4">
      <title>3. Corpora</title>
      <p>Our demonstration is made up of a public-facing
web page6 that provides Galician translations for
Tsyhsetemmacinompaerfarloleml sOouprucse2s. wIne upsaerdtitcoultarraiwnetuhseeNd MthTe both Spanish and English inputs. Users will be
with–  aboauntd2 mil–lion speanrtteinticoenss poefrbloatnhgEuaugroe,paarnl3d, (asbeleeFtiogutreest1t)hwehseyrsetetmheyvicaoualnd ospeleenctwtehbe ilnantegrufaagcee
pair ( – or  – ) and translation system</p>
      <sec id="sec-4-1">
        <title>2https://opus.nlpl.eu</title>
        <p>3https://opus.nlpl.eu/Europarl.php</p>
      </sec>
      <sec id="sec-4-2">
        <title>4https://opus.nlpl.eu/OpenSubtitles.php 5https://repositori.upf.edu/handle/10230/20051 6https://demos.citius.usc.es/nos_tradutor</title>
        <p>(LSTM or Transformer) to then enter text and
generate translations.</p>
        <p>
          In our demonstration, we plan to show where our
system performs well and where it does not perform
well. As an example, the sentence translated from
Spanish to Galician using the LSTM system in Table
2 is an excellent translation despite its long length.
Additionally, our system translations perform well
with syntax and seem to generally translate better
than previous systems tested on the same domain.
Nonetheless, we have found that when comparing
our system’s performance for lexical and
morphological quality, the Portuguese transliteration afect the
performance, found to be better on other rule-based
MT systems like Apertium [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] for example.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Future work</title>
      <p>We plan to perform further work with a
human-inthe-loop to increase the performance based on
quality. This is outlined by a continuous improvement
plan which insinuates the inclusion of translators
for user functionality tests. For example, spelling
and lexical issues such as acidente instead of
accidente, formal Galician diferences that need to be
addressed are first to be solved using newly-developed
heuristics as part of our future contingency plan.
The aim will be to create the highest-quality
system in order expand the language pairs to other
languages such as Russian or Chinese.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was funded by the project “Nós:
Galician in the society and economy of artificial
intelligence”, agreement between Xunta de Galicia
and University of Santiago de Compostela, and
grant ED431G2019/04 by the Galician Ministry
of Education, University and Professional
Training, and the European Regional Development Fund
(ERDF/FEDER program).</p>
      <sec id="sec-6-1">
        <title>Spanish</title>
      </sec>
      <sec id="sec-6-2">
        <title>Debemos imponer el cumplimiento de los reglamentos</title>
        <p>y velar por que se aplique el principio de que “el que
contamina paga” para que se utilicen sanciones y también
incentivos financieros a fin de presionar a los propietarios
de los buques y las compañías petroleras y lograr que se
introduzcan los procedimientos mejores.</p>
      </sec>
      <sec id="sec-6-3">
        <title>Galician</title>
      </sec>
      <sec id="sec-6-4">
        <title>Temos de impor o cumpremento dos regulamentos e celar</title>
        <p>por que o principio do poluidor-pagador sexa aplicado para
que sexan utilizadas sancións e tamén incentivos
financeiros a fin de exercer presión sobre os proprietarios dos
navíos e das compañías petrolíferas e conseguir que os
procedementos mellores sexan introducidos.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Knowles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Koehn</surname>
          </string-name>
          ,
          <article-title>A comparison of machine translation paradigms for use in black-box fuzzy-match repair</article-title>
          ,
          <source>in: Proceedings of the AMTA 2018 Workshop on Translation Quality Estimation and Automatic PostEditing</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>249</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Forcada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ginestí-Rosell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nordfalk</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. O'Regan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ortiz-Rojas</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Pérez-Ortiz</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Sánchez-Martínez</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Ramírez-Sánchez</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          <string-name>
            <surname>Tyers</surname>
          </string-name>
          ,
          <article-title>Apertium: a free/open-source platform for rule-based machine translation</article-title>
          ,
          <source>Machine translation 25</source>
          (
          <year>2011</year>
          )
          <fpage>127</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Bengio,</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          ,
          <source>arXiv preprint arXiv:1409.0473</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Koehn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Knowles</surname>
          </string-name>
          ,
          <article-title>Six challenges for neural machine translation</article-title>
          ,
          <source>arXiv preprint arXiv:1706.03872</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. O.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Improved zero-shot neural machine translation via ignoring spurious correlations, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>1258</fpage>
          -
          <lpage>1268</lpage>
          . URL: https:// aclanthology.org/P19-1121. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1121.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. R. P.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gamallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>García</surname>
          </string-name>
          , Carvalho:
          <article-title>Englishgalician smt system from europarl englishportuguese parallel corpus</article-title>
          ,
          <source>Procesamiento Del Lenguaje Natural</source>
          (
          <year>2009</year>
          )
          <fpage>379</fpage>
          -
          <lpage>381</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Mamani</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. Cho,</surname>
          </string-name>
          <article-title>Neural machine translation with a polysynthetic low resource language</article-title>
          ,
          <source>Machine Translation</source>
          <volume>34</volume>
          (
          <year>2020</year>
          )
          <fpage>325</fpage>
          -
          <lpage>346</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Castro-Mamani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Montoya</surname>
          </string-name>
          <string-name>
            <surname>Samame</surname>
          </string-name>
          ,
          <article-title>Overcoming resistance: The normalization of an Amazonian tribal language</article-title>
          ,
          <source>in: Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, Association for Computational Linguistics</source>
          , Suzhou, China,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .loresmt-
          <volume>1</volume>
          .1.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>URL: https://aclanthology.org/L18-1275.</mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Pichel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gamallo</surname>
          </string-name>
          , I. Alegria,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neves</surname>
          </string-name>
          ,
          <article-title>A methodology to measure the diachronic language distance between three languages based on perplexity</article-title>
          ,
          <source>Journal of Quantitative Linguistics</source>
          <volume>28</volume>
          (
          <year>2021</year>
          )
          <fpage>306</fpage>
          -
          <lpage>336</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Knight</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Graehl</surname>
          </string-name>
          , Machine transliteration,
          <source>arXiv preprint cmp-lg/9704003</source>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Senellart</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Rush,
          <article-title>OpenNMT: Open-source toolkit for neural machine translation</article-title>
          ,
          <source>in: Proceedings of ACL</source>
          <year>2017</year>
          ,
          <string-name>
            <given-names>System</given-names>
            <surname>Demonstrations</surname>
          </string-name>
          .,
          <source>Association for Computational Linguistics</source>
          , Vancouver, Canada,
          <year>2017</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>72</lpage>
          . URL: https: //www.aclweb.org/anthology/P17-4012.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gamallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Piñeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Martinez-Castaño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Pichel</surname>
          </string-name>
          ,
          <article-title>LinguaKit: A Big Data-Based Multilingual Tool for Linguistic Analysis and Information Extraction</article-title>
          , in: 2018
          <source>Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>239</fpage>
          -
          <lpage>244</lpage>
          . doi:
          <volume>10</volume>
          .1109/SNAMS.
          <year>2018</year>
          .
          <volume>8554689</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Peitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Nallasamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Paulik</surname>
          </string-name>
          ,
          <article-title>Jointly learning to align and translate with transformer models</article-title>
          , CoRR abs/
          <year>1909</year>
          .
          <year>02074</year>
          (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1909</year>
          .
          <year>02074</year>
          . arXiv:
          <year>1909</year>
          .
          <year>02074</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kudo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Richardson</surname>
          </string-name>
          ,
          <article-title>Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing</article-title>
          , arXiv preprint arXiv:
          <year>1808</year>
          .
          <volume>06226</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [15]
          <string-name>
            <surname>M. D. C. Bayón</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Sánchez-Gijón</surname>
          </string-name>
          ,
          <article-title>Evaluating machine translation in a low-resource language combination: Spanish-galician</article-title>
          .,
          <source>in: Machine Translation Summit XVII</source>
          Vol.
          <volume>2</volume>
          :
          <string-name>
            <surname>Translator</surname>
            , Project and
            <given-names>User</given-names>
          </string-name>
          <string-name>
            <surname>Tracks</surname>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>30</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Kouylekov, OpenSubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora</article-title>
          ,
          <source>in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ),
          <article-title>European Language Resources Association (ELRA), Miyazaki</article-title>
          , Japan,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>