Multilingual Terminological Resources: Comparing Machine Translation and Corpus-Based Translation

Multilingual Terminological Resources: Comparing Machine Translation and Corpus-Based Translation MelaniaCabezas-García University of Granada

C/ Buensuceso 11 18002 Granada Spain

PilarLeón-Araúz University of Granada

C/ Buensuceso 11 18002 Granada Spain

Multilingual Terminological Resources: Comparing Machine Translation and Corpus-Based Translation 5F529443C75785C85AD2F3202A01DC3A GROBID - A machine learning software for extracting information from scholarly documents Multiword term machine translation corpus specialized translation terminology

Terminological resources increasingly use machine translation as a method to speed up time and reduce costs. With a view to enhancing the multilingual representation of multiword terms (e.g. passive stall-regulated wind turbine) in terminological resources, we describe an analysis of English-Spanish multiword term translation in various machine translation systems, paying special attention to the errors encountered. A comparison of machine translation output with the equivalents found in a comparable corpus is also presented. Even though machine translation often shows errors, it can serve as a basis for human post-editing, thus saving time and costs in terminological work. Comparable corpora, on the other hand, offer better results, but searches are more time-consuming.

Introduction

With a view to expanding markets and disseminating knowledge, specialized texts generate a large volume of translations. Terminological resources should assist in this respect by means of the inclusion of multilingual information. In this sense, machine translation is increasingly being used as a method to speed up time and reduce costs [1].

This paper focuses on the translation of distinctive units of scientific texts, i.e. multiword terms (e.g. passive stall-regulated wind turbine), which pose problems both to human translators and natural language processing systems. However, multiword term machine translation has not been the focus of attention with some exceptions, such as [2]. This is especially true of more complex multiword terms that have three or more constituents.

In order to enhance the multilingual representation of multiword terms in terminological resources, we carried out the following tasks: (i) we analyzed English-Spanish multiword term translation in various machine translation systems; (ii) developed a proposal of the causes that may generate errors in multiword term machine translation; and (iii) compared machine translation output with the equivalents that may be manually found in corpora. For this purpose, a set of three-, four-, and five-term English multiword terms related to environmental science were extracted from a specialized corpus on this field (10,228,919 words, [3]). Environmental science was chosen due to the large volume of translations generated as a result of the increasing environmental awareness.

Machine Translation versus Corpus-Based Translation

1st International Conference on "Multilingual digital terminology today. Design, representation formats and management systems", June 16 -17, Padova, Italy EMAIL: melaniacabezas@ugr.es (M. Cabezas-García); pleon@ugr.es (P. León-Araúz) ORCID: 0000-0002-8622-1036 es (M. Cabezas-García); 0000-0002-8520-2749 (P. León-Araúz)

Far from the classic challenging view of machine translation, according to which it would replace human translators, machine translation also presents opportunities not only to human translators, as evidenced in the great demand for machine translation post-editing (i.e. reviewing and enhancing a machine translation), but also to terminologists. Evidently, including post-editing in the workflow brings added value to machine translation, minimizing possible mistakes and providing quality equivalents to be included in terminological resources.

Even though training a neural machine translation system by means of carefully selected corpora from the specialized subject field could provide better results than using generic machine translation engines, the truth is that translators usually do not have user friendly tools to train their own domainspecific engines. For this reason, the selected English multiword terms were provided without context to different generic machine translation engines: Google Translate and DeepL (neural systems), and Apertium (rule-based system).

To compare machine translations with equivalents found in corpora, parallel or comparable corpora can be used. Parallel corpora are sets of original texts aligned with their translations, thus facilitating the identification of equivalents. However, such corpora are scarce, especially in languages other than English, and generally show a marked influence of the source text on the translation. In contrast, comparable corpora are more useful. Since they are two sets of original texts of the same type and subject, they can be used to analyze native expressions in each language [4].

Therefore, a Spanish comparable corpus was used, which includes environmental texts originally written in this language (10,667,434 words). Techniques for identifying multiword term equivalents in corpora [5] were employed since translation identification in comparable corpora is not as direct as in machine translation.

Translating Multiword Terms using Machine Translation and Corpora

Multiword terms pose problems both to human translators and natural language processing systems since their adequate translation must consider aspects such as their internal dependencies, the semantic relation between constituents, the specialization of elements, etc. [5]. Many of these issues involve human intelligence, which machine translation lacks. General multiword expressions (e.g. take a seat, by and large, let's go, as soon as) have been widely explored in machine translation [6][7][8][9][10]. However, specialized multiword terms have received considerably less attention.

Not surprisingly, our results revealed that machine translation systems' output varies in the different engines. They often show errors of different nature and magnitude, which were used to establish the different causes behind them and could be used to enhance machine translation systems. These errors include: (i) the wrong identification of internal dependencies ([doubly fed] [induction generator] > inducción alimentada doblemente generador, lit. *generator doubly fed induction); (ii) the wrong translation of constituents (wave turbulence interaction parameterization > interacción de turbulencia ondulatoria parameterization); and (iii) the wrong identification of the internal semantic relation (windgenerated electricity > viento-electricidad generada, lit. *generated wind-electricity). However, machine translation can serve as a basis for human post-editing, thus saving time and costs in terminological work.

Comparable corpora, on the other hand, offer better results, but searches are more time-consuming. Ideally, these different techniques should be integrated into translators' and terminologists workflow, something that language service providers in the 2020s are bound to do. Furthermore, these results can be integrated into training for future translators and terminologists, who will have to work in this everchanging reality.

Acknowledgements

This research was carried out as part of projects PID2020-118369GB-I00, Transversal integration of culture into an environmental terminological knowledge base (TRANSCULTURE), funded by the Spanish Ministry of Science and Innovation; and project A-HUM-600-UGR20, Culture as a transversal module in an environmental terminological knowledge base (CULTURAMA), funded by the ERDF Operational Programme for Andalucía 2014-2020.

Automatic Enrichment of Terminological Resources: the IATE RDF Example MArcan EMontiel-Ponsoda JPMccrae PBuitelaar Proceedings of LREC 2018 LREC 2018 2018 Improving machine translation output of German compound and multiword financial terms: a comparison with cross-linguistic data ChristinaValavani ChristinaAlexandris GeorgeKMikros Human-Intelligent Systems Integration 2 2020 The EcoLexicon English Corpus as an open corpus in Sketch Engine PLeón-Araúz ASan Martín AReimerink Proceedings of the 18th EURALEX International Congress JČibej VGorjanc IKosem SKrek the 18th EURALEX International Congress

Ljubljana, Euralex

2018 Terminology and translation LBowker Handbook of Terminology HKockaert FSteurs

Amsterdam, Philadelphia

John Benjamins 2015 Procedimiento para la traducción de términos poliléxicos con la ayuda de corpus MCabezas-García PLeón-Araúz Sistemas fraseológicos en contraste: Enfoques computacionales y de corpus GCorpas Pastor MRBautista Zambrana CMHidalgo Ternero

Comares, Granada

2021 Multiword Expressions and Machine Translation ArviHurskainen Technical Reports in Language Technology 1 2008 Report When multiwords go bad in machine translation ABarreiro JMonti BOrliac FBatista MT Summit Workshop Proceedings on Multi-word Units in Machine Translation and Translation Technology 2013 Multiword Expression Processing: A Survey Constant GülşenMathieu JohannaEryiǧit LonnekeMonti CarlosVan Der Plas MichaelRamisch AmaliaRosner Todirascu Computational Linguistics 43 4 2017 Detecting and Integrating Multiword Expression into English-Arabic Statistical Machine Translation SaraEbrahim DoaaHegazy MostafaGadal-Haqq MMostafa RSamhaa El-Beltagy Procedia Computer Science 117 2017 Multiword Expression aware Neural Machine Translation AZaninello ABirch Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) the 12th Conference on Language Resources and Evaluation (LREC 2020) 2020