<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilingual Terminological Resources: Comparing Translation and Corpus-Based Translation Machine</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Melania Cabezas-García</string-name>
          <email>melaniacabezas@ugr.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pilar León-Araúz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Granada</institution>
          ,
          <addr-line>C/ Buensuceso 11, Granada, 18002</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Terminological resources increasingly use machine translation as a method to speed up time and reduce costs. With a view to enhancing the multilingual representation of multiword terms (e.g. passive stall-regulated wind turbine) in terminological resources, we describe an analysis of English-Spanish multiword term translation in various machine translation systems, paying special attention to the errors encountered. A comparison of machine translation output with the equivalents found in a comparable corpus is also presented. Even though machine translation often shows errors, it can serve as a basis for human post-editing, thus saving time and costs in terminological work. Comparable corpora, on the other hand, offer better results, but searches are more time-consuming.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Multiword term</kwd>
        <kwd>machine translation</kwd>
        <kwd>corpus</kwd>
        <kwd>specialized translation</kwd>
        <kwd>terminology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With a view to expanding markets and disseminating knowledge, specialized texts generate a large
volume of translations. Terminological resources should assist in this respect by means of the inclusion
of multilingual information. In this sense, machine translation is increasingly being used as a method
to speed up time and reduce costs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        This paper focuses on the translation of distinctive units of scientific texts, i.e. multiword terms (e.g.
passive stall-regulated wind turbine), which pose problems both to human translators and natural
language processing systems. However, multiword term machine translation has not been the focus of
attention with some exceptions, such as [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This is especially true of more complex multiword terms
that have three or more constituents.
      </p>
      <p>
        In order to enhance the multilingual representation of multiword terms in terminological resources,
we carried out the following tasks: (i) we analyzed English-Spanish multiword term translation in
various machine translation systems; (ii) developed a proposal of the causes that may generate errors in
multiword term machine translation; and (iii) compared machine translation output with the equivalents
that may be manually found in corpora. For this purpose, a set of three-, four-, and five-term English
multiword terms related to environmental science were extracted from a specialized corpus on this field
(10,228,919 words, [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). Environmental science was chosen due to the large volume of translations
generated as a result of the increasing environmental awareness.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Machine Translation versus Corpus-Based Translation</title>
      <p>Far from the classic challenging view of machine translation, according to which it would replace
human translators, machine translation also presents opportunities not only to human translators, as
evidenced in the great demand for machine translation post-editing (i.e. reviewing and enhancing a
machine translation), but also to terminologists. Evidently, including post-editing in the workflow
brings added value to machine translation, minimizing possible mistakes and providing quality
equivalents to be included in terminological resources.</p>
      <p>Even though training a neural machine translation system by means of carefully selected corpora
from the specialized subject field could provide better results than using generic machine translation
engines, the truth is that translators usually do not have user friendly tools to train their own
domainspecific engines. For this reason, the selected English multiword terms were provided without context
to different generic machine translation engines: Google Translate and DeepL (neural systems), and
Apertium (rule-based system).</p>
      <p>
        To compare machine translations with equivalents found in corpora, parallel or comparable corpora
can be used. Parallel corpora are sets of original texts aligned with their translations, thus facilitating
the identification of equivalents. However, such corpora are scarce, especially in languages other than
English, and generally show a marked influence of the source text on the translation. In contrast,
comparable corpora are more useful. Since they are two sets of original texts of the same type and
subject, they can be used to analyze native expressions in each language [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Therefore, a Spanish comparable corpus was used, which includes environmental texts originally
written in this language (10,667,434 words). Techniques for identifying multiword term equivalents in
corpora [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] were employed since translation identification in comparable corpora is not as direct as in
machine translation.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Translating Multiword Terms using Machine Translation and Corpora</title>
      <p>
        Multiword terms pose problems both to human translators and natural language processing systems
since their adequate translation must consider aspects such as their internal dependencies, the semantic
relation between constituents, the specialization of elements, etc. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Many of these issues involve
human intelligence, which machine translation lacks. General multiword expressions (e.g. take a seat,
by and large, let's go, as soon as) have been widely explored in machine translation [
        <xref ref-type="bibr" rid="ref10 ref6 ref7 ref8 ref9">6-10</xref>
        ]. However,
specialized multiword terms have received considerably less attention.
      </p>
      <p>Not surprisingly, our results revealed that machine translation systems’ output varies in the different
engines. They often show errors of different nature and magnitude, which were used to establish the
different causes behind them and could be used to enhance machine translation systems. These errors
include: (i) the wrong identification of internal dependencies ([doubly fed] [induction generator] &gt;
inducción alimentada doblemente generador, lit. *generator doubly fed induction); (ii) the wrong
translation of constituents (wave turbulence interaction parameterization &gt; interacción de turbulencia
ondulatoria parameterization); and (iii) the wrong identification of the internal semantic relation
(windgenerated electricity &gt; viento-electricidad generada, lit. *generated wind-electricity). However,
machine translation can serve as a basis for human post-editing, thus saving time and costs in
terminological work.</p>
      <p>Comparable corpora, on the other hand, offer better results, but searches are more time-consuming.
Ideally, these different techniques should be integrated into translators’ and terminologists workflow,
something that language service providers in the 2020s are bound to do. Furthermore, these results can
be integrated into training for future translators and terminologists, who will have to work in this
everchanging reality.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Acknowledgements</title>
      <p>This research was carried out as part of projects PID2020-118369GB-I00, Transversal integration
of culture into an environmental terminological knowledge base (TRANSCULTURE), funded by the
Spanish Ministry of Science and Innovation; and project A-HUM-600-UGR20, Culture as a transversal
module in an environmental terminological knowledge base (CULTURAMA), funded by the ERDF
Operational Programme for Andalucía 2014-2020.</p>
    </sec>
    <sec id="sec-5">
      <title>5. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Arcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Montiel-Ponsoda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          .
          <article-title>Automatic Enrichment of Terminological Resources: the IATE RDF Example</article-title>
          ,
          <source>Proceedings of LREC</source>
          <year>2018</year>
          ,
          <volume>930</volume>
          -
          <fpage>937</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Valavani</surname>
            ,
            <given-names>Christina</given-names>
            , Christina
          </string-name>
          <string-name>
            <surname>Alexandris</surname>
          </string-name>
          , and
          <string-name>
            <surname>George</surname>
            <given-names>K. Mikros. “</given-names>
          </string-name>
          <article-title>Improving machine translation output of German compound and multiword financial terms: a comparison with cross-linguistic data</article-title>
          .
          <source>” Human-Intelligent Systems Integration</source>
          <volume>2</volume>
          (
          <year>2020</year>
          ):
          <fpage>29</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>León-Araúz</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . San Martín,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reimerink</surname>
          </string-name>
          .
          <article-title>The EcoLexicon English Corpus as an open corpus in Sketch Engine</article-title>
          ,
          <source>Proceedings of the 18th EURALEX International Congress</source>
          , edited by Čibej, J.,
          <string-name>
            <surname>Gorjanc</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosem</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krek</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <volume>893</volume>
          -
          <fpage>901</fpage>
          , Ljubljana, Euralex,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bowker</surname>
          </string-name>
          .
          <article-title>Terminology and translation</article-title>
          , in: H.
          <string-name>
            <surname>Kockaert</surname>
            and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Steurs</surname>
          </string-name>
          (Eds.),
          <source>Handbook of Terminology</source>
          , John Benjamins, Amsterdam, Philadelphia,
          <year>2015</year>
          , pp.
          <fpage>304</fpage>
          -
          <lpage>323</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cabezas-García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>León-Araúz</surname>
          </string-name>
          .
          <article-title>Procedimiento para la traducción de términos poliléxicos con la ayuda de corpus</article-title>
          , in: G. Corpas Pastor,
          <string-name>
            <given-names>M. R. Bautista</given-names>
            <surname>Zambrana</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. M. Hidalgo Ternero</surname>
          </string-name>
          (Eds.), Sistemas fraseológicos en contraste: Enfoques computacionales y de corpus, Comares, Granada,
          <year>2021</year>
          , pp.
          <fpage>203</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Hurskainen</surname>
          </string-name>
          , Arvi. “
          <source>Multiword Expressions and Machine Translation.” Technical Reports in Language Technology, Report No 1</source>
          (
          <year>2008</year>
          ):
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barreiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Monti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Orliac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Batista</surname>
          </string-name>
          .
          <article-title>When multiwords go bad in machine translation</article-title>
          ,
          <source>MT Summit Workshop Proceedings on Multi-word Units in Machine Translation and Translation Technology</source>
          ,
          <fpage>26</fpage>
          -
          <lpage>33</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Constant</surname>
          </string-name>
          , Mathieu, Gülşen Eryiǧit, Johanna Monti, Lonneke van der Plas, Carlos Ramisch,
          <source>Michael Rosner and Amalia Todirascu. “Multiword Expression Processing: A Survey.” Computational Linguistics</source>
          <volume>43</volume>
          ,
          <issue>4</issue>
          (
          <year>2017</year>
          ):
          <fpage>837</fpage>
          -
          <lpage>892</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Ebrahim</surname>
          </string-name>
          , Sara, Doaa Hegazy, Mostafa
          <string-name>
            <surname>Gadal-Haqq M. Mostafa</surname>
          </string-name>
          and
          <string-name>
            <surname>Samhaa R. El-Beltagy</surname>
          </string-name>
          .
          <source>“Detecting and Integrating Multiword Expression into English-Arabic Statistical Machine Translation.” Procedia Computer Science</source>
          <volume>117</volume>
          (
          <year>2017</year>
          ):
          <fpage>111</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaninello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birch</surname>
          </string-name>
          .
          <source>Multiword Expression aware Neural Machine Translation, Proceedings of the 12th Conference on Language Resources and Evaluation (LREC</source>
          <year>2020</year>
          ),
          <fpage>3816</fpage>
          -
          <lpage>3825</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>