Assessing the Use of Terminology in Phrase-Based Statistical Machine Translation for Academic Course Catalogues Translation Randy Scansani Marcello Federico Luisa Bentivogli University of Bologna Fondazione Bruno Kessler Fondazione Bruno Kessler Forlı̀, Italy Trento, Italy Trento, Italy randy.scansani@unibo.it federico@fbk.eu bentivo@fbk.eu Abstract termini. I risultati evidenziano che, nonos- tante il metodo elementare utilizzato per English. In this contribution we describe inserire i termini nel sistema di MT, il an approach to evaluate the use of termi- termbase in alcuni casi in grado di infuen- nology in a phrase-based machine trans- zare la scelta dei termini nell’output. lation system to translate course unit de- scriptions from Italian into English. The genre is very prominent among those re- 1 Introduction quiring translation by universities in Euro- pean countries where English is not a na- Availability of course unit descriptions or course tive language. Two MT engines are trained catalogues in multiple languages has started to on an in-domain bilingual corpus and a play a key role for universities especially after subset of the Europarl corpus, and one the Bologna process (European Commission et al., of them is enhanced adding a bilingual 2015) and the resulting growth in student mobil- termbase to its training data. Overall sys- ity. These texts aim at providing students with all tems’ performance is assessed through the the relevant information regarding contents, pre- BLEU score, whereas the f-score is used requisites, learning outcomes, etc. to focus the evaluation on term transla- Since course unit descriptions have to be drafted tion. Furthermore, a manual analysis of in large quantities on a yearly basis, universities the terms is carried out. Results suggest would benefit from the use of machine transla- that in some cases - despite the simplistic tion (MT). Indeed, the importance of developing approach implemented to inject terms into MT tools in this domain is further testified by two the MT system - the termbase was able to previous projects funded by the EU Commission, bias the word choice of the engine. i.e. TraMOOC1 and Bologna Translation Service2 . The former differs from the present work since it Italiano. Nel presente lavoro viene de- does not focus on academic courses, while the lat- scritto un metodo per valutare l’uso di ter does not seem to have undergone substantial terminologia in un sistema PBSMT per development after 2013 and in addition to that, tradurre descrizioni di unità formative it does not include the Italian-English language dall’italiano in inglese. La traduzione di combination. questo genere di testi è fondamentale Automatically producing multilingual versions per le università di Paesi europei dove of course unit descriptions poses a number of chal- l’inglese non è una lingua ufficiale. Due lenges. A first major issue for MT systems is the sistemi di MT vengono addestrati su un scarcity of high quality human-translated paral- corpus in-domain e un sottoinsieme del lel texts of course unit descriptions. Also, de- corpus Europarl. Ad uno dei due sistemi scriptions feature not only terms that are typi- viene aggiunto un glossario bilingue. La cal of institutional academic communication, but valutazione delle prestazioni globali dei also expressions that belong to specific disciplines sistemi avviene tramite BLEU score, men- (Ferraresi, 2017). This makes it cumbersome to tre f-score usato per la valutazione speci- 1 Translation for Massive Open Online Course http:// fica della traduzione dei termini. È stata tramooc.eu/ 2 inoltre condotta un’analisi manuale dei http://www.bologna-translation.eu choose the right resources and the most effective 0.3% BLEU points (Papineni et al., 2002). method to add them to the MT engine. Other experiments have focused on how to in- For this study, we chose to concentrate on sert terms in an MT system without having to stop course units belonging to the disciplinary domain or re-train it. These dynamic methods suit the pur- of exact sciences, since Italian degree programmes pose of the present paper, as they focus (also) whose course units belong to this domain translate on Italian-English. Arcan et al. (2014b) injected their contents into English more often than other bilingual terms into a SMT system dynamically, programmes. observing an improvement of up to 15% BLEU A phrase-based statistical machine translation points for English-Italian in medical and IT do- system (PBSMT) was used to translate course unit mains. For the same domains and with the same descriptions from Italian into English. We trained languages (in both directions), Arcan et al. (2014a) one engine on a subset of the Europarl corpus and developed an architecture to identify terminology on a small in-domain corpus including course unit in a source text and translate it using Wikipedia descriptions and degree programs (see sect. 3.1) as a resource. The terms obtained were then dy- belonging to the domain of the exact sciences. namically added to the SMT system. This study Then, we enriched the training data set with a resulted in an improvement of up to 13% BLEU bilingual terminology database belonging to the score points. educational domain (see sect. 3.2) and built a new We have seen that results for the languages we engine. To assess the overall performance of the are working on are encouraging, but since they are two systems we automatically evaluated them with strongly influenced by several factors – i.e. the the BLEU score. We then focused on the evalua- domain and the injection method – an experiment tion of terminology translation, by computing the on academic institutional texts is required in or- f-score on the list of termbase entries occurring der to test the influence of bilingual terminology both in the system outputs and in the reference resources on the output. translation (see sect. 4). Finally, to gather more information on term translation, a manual analysis 3 Experimental Setup was carried out (see sect. 5). 3.1 Corpora 2 Previous work A subset of 300,000 sentence pairs was extracted from the Europarl Italian-English bilingual cor- A number of approaches have already been de- pus (Koehn, 2005). Limiting the number of sen- veloped to use in-domain resources like corpora tence pairs of the generic corpus was necessary and terminology in statistical machine translation due to limitations of the computational resources (SMT), indirectly tackling the domain-adaptation available. Then, bilingual corpora belonging to challenge for MT. For example, the WMT 2007 the academic domain were needed as development shared task was focused on domain adaptation and evaluation data sets and to enhance the train- in a scenario in which a small in-domain corpus ing data set. One course unit description corpus is available and has to be integrated with large was available thanks to the CODE project3 . After generic corpora (Koehn and Schroeder, 2007; cleaning of texts not belonging to the exact sci- Civera and Juan, 2007). Recently, the work by ence domain, we merged the corpus with other Štajner et al. (2016) showed that an English- two smaller corpora made of course unit descrip- Portuguese PBSMT system in the IT domain tions. We then extracted 3,500 sentence pairs to achieved best results when trained on a large use them as development set. generic corpus and in-domain terminology. Relying only on course unit descriptions to train For French-English in the military domain, our engines could have led to an over-fitting of Langlais (2002) reported on improvements of the models. Moreover, high quality parallel course the WER score after using existing termino- unit descriptions are often difficult to be found. To logical resources as constraints to reduce the search space. For the same language combination, 3 CODE is a project aimed at building corpora and Bouamor et al. (2012) used couples of MWEs ex- tools to support translation of course unit descriptions into English and drafting of these texts in English as tracted from the Europarl corpus as one of the a lingua franca. http://code.sslmit.unibo.it/ training resources, yet only observing a gain of doku.php Data Set Sent. pairs It Tokens En Tokens Training (Europarl) 300,000 7,848,936 8,046,827 Training (in-domain) 34,800 441,030 399,395 Development 3,500 48,671 43,919 Test 3,465 49,066 45,595 Table 1: Number of sentence pairs and tokens in each of the data sets used. overcome these two issues we added a small num- 3.3 Machine Translation System ber of degree program descriptions to our in- We tested the performance of a PBSMT system domain corpus. To conclude, a fourth small course trained on the resources described in sections 3.1 unit descriptions corpus was built to be used as and 3.2. The system used to build the engines evaluation data set. All the details regarding the for this experiment is the open-source ModernMT sentence pairs and tokens are provided in Table 1. (MMT)7 (Bertoldi et al., 2017). Two engines were built in MMT: 3.2 Terminology The terminology database was created merging • One engine trained on the subset of Europarl three different IATE (InterActive Terminology for plus our in-domain corpus. Europe)4 termbases for both languages and adding to them the terms extracted from the fifth volume • One engine trained on the subset of Europarl of the Eurydice5 glossaries. More specifically, the plus our in-domain corpus and the terminol- three different IATE termbases were: Education, ogy database. Teaching, Organization of teaching. To verify the relevance of our termbase with re- Both engines were tuned on our development set spect to the training data we measured its cover- and evaluated on the test set (see sect. 3.1). age. Since the terms in the termbase are in their 4 Experimental results base form, in order to obtain a more accurate esti- mate we lemmatised6 the training sets before cal- To provide information on the overall translation culating the overlap between the two resources. quality of our PBSMT engines, we calculated the As we can see in Table 2, the 24.08% of the BLEU scores (Papineni et al., 2002) obtained on termbase entries are also in the source side of the the test set. Table 3 shows the results for both en- two training corpora, and 29.19% are in the target gines, where the engine without terminology is re- side, meaning that the two resources complement ferred to as w/o terms and the one with terminol- each other well. ogy is referred to as w/ terms. Furthermore, we evaluated the systems focusing It En on their performance on terminology translation. Europarl lemmas 7,848,936 8,046,827 To this purpose, we relied on the f-score. More in In-domain lemmas 441,030 399,395 detail, for both engines we extracted the number of Termbase entries 4,142 4,142 English termbase entries appearing in the system Europarl overlap 23.03% 29.20% output and in the reference translation. Exploiting In-domain overlap 27.52% 29.33% these figures, we were able to compute Precision, Total overlap 24.08% 29.19% Recall and f-score. Results are reported in Table 4. Table 2: Number of lemmas in the generic and in- domain training sets, termbase entries, and cover- Engine BLEU age of the termbase wrt. training data. w/o terms 25.92 w/ terms 26.00 4 http://iate.europa.eu/ 5 http://eacea.ec.europa.eu/education/ Table 3: BLEU score for the two engines. eurydice/ 6 Lemmatisation was performed using the TreeTag- 7 ger: https://goo.gl/JjHMcZ http://www.modernmt.eu/ w/o terms w/ terms produced by the two engines, still there is a con- Terms in ref 1,133 1,133 siderable number of terms that are different. We Terms in output 1,061 1,083 thus cannot exclude an influence of the termbase Correct terms 633 630 on the word choice of the w/ terms system. For this Precision 0.596 0.581 reason, an in-depth analysis of the different terms Recall 0.558 0.555 produced by the two engines was carried out. F-score 0.577 0.568 5 Manual Evaluation Table 4: Number of occurrences of termbase en- tries in the reference and in the output texts, num- The analysis of the sentences where the termbase ber of terms in the reference appearing also in the entries used by the two engines differed showed outputs, Precision, Recall and F-score. that in some cases the termbase forced the system to use its target term even if a different transla- tion - sometimes also correct - was present in the The figures in Tables 3 and 4 show that adding training corpora. Some examples are reported in our termbase to the training data set does not af- Table 5. For the source words prova orale (Ex- fect the output in a substantial way. While ac- ample 1) and esame scritto (Example 2), the en- cording to the BLEU score the w/ terms engine gine w/ terms used oral examination and written slightly outperforms the w/o terms engine, the f- examination, while the one w/o terms used writ- score – indicating performance on term translation ten exam and oral exam, but only the occurrences – is marginally higher for the w/o terms system. with examination are in the termbase. More- Focusing on the usage of terminology, a number over, Example 2 also includes the termbase word of observations can be made. As regards the dis- preparazione, which is translated with preparation tribution of termbase entries in the test set - which by the engine w/ terms, while it is not translated at contains 3,465 sentence pairs - it is interesting to all by the engine w/o terms. know that the number of output and reference sen- Another interesting example is the translation tences containing at least one term is fairly low, i.e. of the source word docente (Example 3), where 945 (27.30%) for the reference text, 866 (24.99%) the termbase corrected a wrong translation. The for the w/o terms output and 870 (25.10%) for the Italian term was wrongly translated with lecture w/ terms output. by the engine w/o terminology, and with teacher - Considering the terms found in the two out- which is the right translation for this text - by the puts, we observe that their number only differs engine w/ terminology. by 23 units (ca. 2% of the number of terms in In Example 4, the Italian sentence contained the the outputs). Also, the number of overlapping termbase entry voto finale, which was translated terms is very high, i.e. 882 terms (out of 1,061 with final vote by the engine w/o terms and with for the engine w/o terms and out of 1,083 for the the termbase MWE final mark by the w/ terms en- engine w/ terms). As a matter of fact, the top- gine. Also in this case the termbase corrected a 6 frequent terms in the systems’ outputs are the mistake, since vote is not the correct translation of same – course, oral, ability, lecture, technology voto in this context. and teacher – and cover approximately a half of The comparison between the two engines’ out- the total amount of extracted terms for both out- puts shows that, even though our training data cov- puts. ered the total amount of terms present in the test We then compared the English termbase entries set, the termbase influenced the MT output of the appearing in the target side of the test set to those engine w/ terms biasing the weights assigned to a appearing in the training set. Each of the 78 terms specific translation. occurring at least one time in the test set (corre- Such results have to be judged taking into ac- sponding to 1,133 total occurrences as reported in count the preliminary nature of this study, aimed at Table 4), also occur in the training set – out of understanding the practical implications of using which 60 in its in-domain component. terminology in PBSMT, and therefore exploiting However, even though our training data cover a simplistic approach to inject terms. As a matter the total amount of terms present in the test data, of fact, we found that also some of the termbase and despite the high overlap between the terms entries occurring in the reference – e.g. certifica- SRC La prova orale si svolgerà sugli argomenti del programma del corso. REF The oral verification will be on the topics of the lectures. W / O TERMS The oral exam will take place on the program of the course. ! W / TERMS The oral examination will take place on the program of the course. ! SRC La preparazione dello studente sarà valutata in un esame scritto. REF Student preparation shall be evaluated by a 3 hrs written examination. W / O TERMS The student will be evaluated in a written exam. % W / TERMS The preparation of the student will be evaluated in a written examination. ! SRC Ogni docente titolare REF Each lecturer. W / O TERMS Every lecture. % W / TERMS Every teacher. ! SRC In tal caso il voto finale terrà conto anche della prova orale. REF In this case the final score will be based also on the oral part. W / O TERMS In this case the final vote will take account the oral test. % W / TERMS In this case the final mark will be based also the oral test. ! Table 5: MT output examples showing the influence of the termbase on the word choice of the w/terms engine. Note that the ! and % marks refer to human assessment and not to the correspondence with the reference. tion, instructor, text book, educational material – (see sect. 1) as well. In our future works we are were not used in the output of the system w/ terms therefore planning to focus not only on academic and this is probably due to the limitations of our terminology, but also on the disciplinary one test- method. The terms instructor, text book and ed- ing its impact on the output of an MT engine trans- ucational material did not occur in the w/o terms lating course unit descriptions. output neither, while certification did. After this first experiment on the widely- To sum up, what emerges is that using terminol- used PBSMT architecture, in future work we are ogy in PBSMT to translate course catalogues - and planning to exploit neural machine translation more specifically course unit descriptions - can in- (NMT). In particular, our goal is to develop an fluence the MT output. In our case, since the im- NMT engine able to handle terminology correctly provements were measured against the output of in this text domain, in order to investigate its ef- the w/o terms engine - which might eventually be fect on the post-editor’s work. For this reason, a correct even if using different terms from those in- termbase focused on the institutional academic do- cluded in the termbase - the metrics results were main, e.g. the UCL-K.U.Leuven University Ter- not informative enough and a manual analysis of minology Database8 or the Innsbrucker Termbank the terms had to be carried out. 2.09 could be used to select an adequate bench- mark for the development and evaluation of an MT 6 Conclusion and further work engine with a high degree of accuracy in the trans- lation of terms. This paper has described a preliminary analysis aimed at assessing the use of in-domain terminol- ogy in PBSMT in the institutional academic do- Ackowledgements main, and more precisely for the translation of course unit descriptions from Italian into English. The authors would like to thank Silvia Bernar- Following the results of the present experiment dini, Marcello Soffritti and Adriano Ferraresi from and given its preliminary nature, we are planning the University of Bologna for their advice on ter- to carry out further work in this field. minology and institutional academic communica- In section 4 we have seen that the institutional tion, and Mauro Cettolo from FBK for help with academic terms contained in our testing data also ModernMT. The usual disclaimers apply. appeared in the training data, thus limiting the impact of terminology on the output. However, course catalogues and course unit descriptions in- 8 https://goo.gl/huoevR 9 clude terms belonging to the specific disciplines https://goo.gl/W2GH5h References AAMT, Phuket, Thailand, pages 79–86. http://mt- archive.info/MTS-2005-Koehn.pdf. Mihael Arcan, Claudio Giuliano, Marco Turchi, and Paul Buitelaar. 2014b. Identification of bilin- Philipp Koehn and Josh Schroeder. 2007. Ex- gual terms from monolingual documents for sta- periments in domain adaptation for statisti- tistical machine translation. In Proceedings cal machine translation. In Proceedings of of the 4th International Workshop on Computa- the Second Workshop on Statistical Machine tional Terminology. Dublin, Ireland, pages 22–31. Translation. Association for Computational http://www.aclweb.org/anthology/W14-4803. Linguistics, Prague, StatMT ’07, pages 224–227. http://dl.acm.org/citation.cfm?id=1626355.1626388. Mihael Arcan, Marco Turchi, Sara Tonelli, and Paul Buitelaar. 2014a. Enhancing statistical machine Philippe Langlais. 2002. Improving a general- translation with bilingual terminology in a CAT purpose statistical translation engine by termino- environment. In Yaser Al-Onaizan and Michel logical lexicons. In COLING-02 on COMPUT- Simard, editors, Proceedings of AMTA 2014. Van- ERM 2002: Second International Workshop on couver, BC. Computational Terminology - Volume 14. Asso- ciation for Computational Linguistics, Strouds- Nicola Bertoldi, Roldano Cattoni, Mauro Cet- burg, PA, USA, COMPUTERM ’02, pages 1–7. tolo, Amin Farajian, Marcello Federico, Davide https://doi.org/10.3115/1118771.1118776. Caroselli, Luca Mastrostefano, Andrea Rossi, Marco Trombetti, Ulrich Germann, and David Kishore Papineni, Salim Roukos, Todd Ward, and Madl. 2017. MMT: New open source MT for the Wei-Jing Zhu. 2002. Bleu: A method for au- translation industry. In Proceedings of the 20th tomatic evaluation of machine translation. In Annual Conference of the European Association Proceedings of the 40th Annual Meeting of the for Machine Translation. Prague, pages 86–91. Association for Computational Linguistics. Asso- https://ufal.mff.cuni.cz/eamt2017/user-project- ciation for Computational Linguistics, Philadel- product-papers/papers/user/EAMT2017 paper 88.pdf. phia, Pennsylvania, ACL ’02, pages 311–318. Dhouha Bouamor, Nasredine Semmar, and Pierre https://doi.org/10.3115/1073083.1073135. Zweigenbaum. 2012. Identifying bilingual Sanja Štajner, Andreia Querido, Nuno Rendeiro, multi-word expressions for statistical machine João António Rodrigues, and António Branco. 2016. translation. In Nicoletta Calzolari, Khalid Choukri, Use of domain-specific language resources in ma- Thierry Declerck, Mehmet Uğur Doğan, Bente chine translation. In Nicoletta Calzolari, Khalid Maegaard, Joseph Mariani, Jan Odijk, and Choukri, Thierry Declerck, Sara Goggi, Marko Gro- Stelios Piperidis, editors, Proceedings of the belnik, Bente Maegaard, Joseph Mariani, Helene Eighth International Conference on Language Mazo, Asuncion Moreno, Jan Odijk, and Stelios Resources and Evaluation (LREC-2012). Euro- Piperidis, editors, Proceedings of the Tenth Interna- pean Language Resources Association (ELRA), tional Conference on Language Resources and Eval- Istanbul, Turkey, pages 674–679. ACL An- uation (LREC 2016). European Language Resources thology Identifier: L12-1527. http://www.lrec- Association (ELRA), Paris, France, pages 592–598. conf.org/proceedings/lrec2012/pdf/886 Paper.pdf. Jorge Civera and Alfons Juan. 2007. Domain adaptation in statistical machine translation with mixture modelling. In Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Lin- guistics, Prague, Czech Republic, pages 177–180. http://www.aclweb.org/anthology/W/W07/W07- 0222. European Commission, EACEA, and Eurydice. 2015. The European Higher Education Area in 2015: Bologna Process Implementation Report. Luxem- bourg: Publications office of the European Union. Adriano Ferraresi. 2017. Terminology in European university settings. The case of course unit de- scriptions. In Paola Faini, editor, Terminological Approaches in the European Context. Cambridge Scholars Publishing, Newcastle upon Tyne, pages 20–40. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Pro- ceedings: the tenth Machine Translation Summit.