Assessing the Use of Terminology in Phrase-Based Statistical Machine
        Translation for Academic Course Catalogues Translation

      Randy Scansani         Marcello Federico                              Luisa Bentivogli
    University of Bologna Fondazione Bruno Kessler                      Fondazione Bruno Kessler
         Forlı̀, Italy          Trento, Italy                                 Trento, Italy
randy.scansani@unibo.it federico@fbk.eu                                   bentivo@fbk.eu


                   Abstract                               termini. I risultati evidenziano che, nonos-
                                                          tante il metodo elementare utilizzato per
   English. In this contribution we describe              inserire i termini nel sistema di MT, il
   an approach to evaluate the use of termi-              termbase in alcuni casi in grado di infuen-
   nology in a phrase-based machine trans-                zare la scelta dei termini nell’output.
   lation system to translate course unit de-
   scriptions from Italian into English. The
   genre is very prominent among those re-        1       Introduction
   quiring translation by universities in Euro-
   pean countries where English is not a na-      Availability of course unit descriptions or course
   tive language. Two MT engines are trained      catalogues in multiple languages has started to
   on an in-domain bilingual corpus and a         play a key role for universities especially after
   subset of the Europarl corpus, and one         the Bologna process (European Commission et al.,
   of them is enhanced adding a bilingual         2015) and the resulting growth in student mobil-
   termbase to its training data. Overall sys-    ity. These texts aim at providing students with all
   tems’ performance is assessed through the      the relevant information regarding contents, pre-
   BLEU score, whereas the f-score is used        requisites, learning outcomes, etc.
   to focus the evaluation on term transla-          Since course unit descriptions have to be drafted
   tion. Furthermore, a manual analysis of        in large quantities on a yearly basis, universities
   the terms is carried out. Results suggest      would benefit from the use of machine transla-
   that in some cases - despite the simplistic    tion (MT). Indeed, the importance of developing
   approach implemented to inject terms into      MT tools in this domain is further testified by two
   the MT system - the termbase was able to       previous projects funded by the EU Commission,
   bias the word choice of the engine.            i.e. TraMOOC1 and Bologna Translation Service2 .
                                                  The former differs from the present work since it
   Italiano. Nel presente lavoro viene de-        does not focus on academic courses, while the lat-
   scritto un metodo per valutare l’uso di        ter does not seem to have undergone substantial
   terminologia in un sistema PBSMT per           development after 2013 and in addition to that,
   tradurre descrizioni di unità formative       it does not include the Italian-English language
   dall’italiano in inglese. La traduzione di     combination.
   questo genere di testi è fondamentale            Automatically producing multilingual versions
   per le università di Paesi europei dove       of course unit descriptions poses a number of chal-
   l’inglese non è una lingua ufficiale. Due     lenges. A first major issue for MT systems is the
   sistemi di MT vengono addestrati su un         scarcity of high quality human-translated paral-
   corpus in-domain e un sottoinsieme del         lel texts of course unit descriptions. Also, de-
   corpus Europarl. Ad uno dei due sistemi        scriptions feature not only terms that are typi-
   viene aggiunto un glossario bilingue. La       cal of institutional academic communication, but
   valutazione delle prestazioni globali dei      also expressions that belong to specific disciplines
   sistemi avviene tramite BLEU score, men-       (Ferraresi, 2017). This makes it cumbersome to
   tre f-score usato per la valutazione speci-        1
                                                      Translation for Massive Open Online Course http://
   fica della traduzione dei termini. È stata    tramooc.eu/
                                                    2
   inoltre condotta un’analisi manuale dei            http://www.bologna-translation.eu
choose the right resources and the most effective    0.3% BLEU points (Papineni et al., 2002).
method to add them to the MT engine.                    Other experiments have focused on how to in-
   For this study, we chose to concentrate on        sert terms in an MT system without having to stop
course units belonging to the disciplinary domain    or re-train it. These dynamic methods suit the pur-
of exact sciences, since Italian degree programmes   pose of the present paper, as they focus (also)
whose course units belong to this domain translate   on Italian-English. Arcan et al. (2014b) injected
their contents into English more often than other    bilingual terms into a SMT system dynamically,
programmes.                                          observing an improvement of up to 15% BLEU
   A phrase-based statistical machine translation    points for English-Italian in medical and IT do-
system (PBSMT) was used to translate course unit     mains. For the same domains and with the same
descriptions from Italian into English. We trained   languages (in both directions), Arcan et al. (2014a)
one engine on a subset of the Europarl corpus and    developed an architecture to identify terminology
on a small in-domain corpus including course unit    in a source text and translate it using Wikipedia
descriptions and degree programs (see sect. 3.1)     as a resource. The terms obtained were then dy-
belonging to the domain of the exact sciences.       namically added to the SMT system. This study
Then, we enriched the training data set with a       resulted in an improvement of up to 13% BLEU
bilingual terminology database belonging to the      score points.
educational domain (see sect. 3.2) and built a new      We have seen that results for the languages we
engine. To assess the overall performance of the     are working on are encouraging, but since they are
two systems we automatically evaluated them with     strongly influenced by several factors – i.e. the
the BLEU score. We then focused on the evalua-       domain and the injection method – an experiment
tion of terminology translation, by computing the    on academic institutional texts is required in or-
f-score on the list of termbase entries occurring    der to test the influence of bilingual terminology
both in the system outputs and in the reference      resources on the output.
translation (see sect. 4). Finally, to gather more
information on term translation, a manual analysis   3       Experimental Setup
was carried out (see sect. 5).
                                                     3.1      Corpora
2   Previous work                                    A subset of 300,000 sentence pairs was extracted
                                                     from the Europarl Italian-English bilingual cor-
A number of approaches have already been de-         pus (Koehn, 2005). Limiting the number of sen-
veloped to use in-domain resources like corpora      tence pairs of the generic corpus was necessary
and terminology in statistical machine translation   due to limitations of the computational resources
(SMT), indirectly tackling the domain-adaptation     available. Then, bilingual corpora belonging to
challenge for MT. For example, the WMT 2007          the academic domain were needed as development
shared task was focused on domain adaptation         and evaluation data sets and to enhance the train-
in a scenario in which a small in-domain corpus      ing data set. One course unit description corpus
is available and has to be integrated with large     was available thanks to the CODE project3 . After
generic corpora (Koehn and Schroeder, 2007;          cleaning of texts not belonging to the exact sci-
Civera and Juan, 2007). Recently, the work by        ence domain, we merged the corpus with other
Štajner et al. (2016) showed that an English-       two smaller corpora made of course unit descrip-
Portuguese PBSMT system in the IT domain             tions. We then extracted 3,500 sentence pairs to
achieved best results when trained on a large        use them as development set.
generic corpus and in-domain terminology.
                                                        Relying only on course unit descriptions to train
   For French-English in the military domain,        our engines could have led to an over-fitting of
Langlais (2002) reported on improvements of          the models. Moreover, high quality parallel course
the WER score after using existing termino-          unit descriptions are often difficult to be found. To
logical resources as constraints to reduce the
search space. For the same language combination,         3
                                                          CODE is a project aimed at building corpora and
Bouamor et al. (2012) used couples of MWEs ex-       tools to support translation of course unit descriptions
                                                     into English and drafting of these texts in English as
tracted from the Europarl corpus as one of the       a lingua franca. http://code.sslmit.unibo.it/
training resources, yet only observing a gain of     doku.php
                     Data Set                Sent. pairs        It Tokens     En Tokens
                     Training (Europarl)     300,000            7,848,936     8,046,827
                     Training (in-domain)    34,800             441,030       399,395
                     Development             3,500              48,671        43,919
                     Test                    3,465              49,066        45,595

              Table 1: Number of sentence pairs and tokens in each of the data sets used.


overcome these two issues we added a small num-       3.3       Machine Translation System
ber of degree program descriptions to our in-         We tested the performance of a PBSMT system
domain corpus. To conclude, a fourth small course     trained on the resources described in sections 3.1
unit descriptions corpus was built to be used as      and 3.2. The system used to build the engines
evaluation data set. All the details regarding the    for this experiment is the open-source ModernMT
sentence pairs and tokens are provided in Table 1.    (MMT)7 (Bertoldi et al., 2017). Two engines were
                                                      built in MMT:
3.2   Terminology
The terminology database was created merging               • One engine trained on the subset of Europarl
three different IATE (InterActive Terminology for            plus our in-domain corpus.
Europe)4 termbases for both languages and adding
to them the terms extracted from the fifth volume          • One engine trained on the subset of Europarl
of the Eurydice5 glossaries. More specifically, the          plus our in-domain corpus and the terminol-
three different IATE termbases were: Education,              ogy database.
Teaching, Organization of teaching.
   To verify the relevance of our termbase with re-   Both engines were tuned on our development set
spect to the training data we measured its cover-     and evaluated on the test set (see sect. 3.1).
age. Since the terms in the termbase are in their
                                                      4        Experimental results
base form, in order to obtain a more accurate esti-
mate we lemmatised6 the training sets before cal-     To provide information on the overall translation
culating the overlap between the two resources.       quality of our PBSMT engines, we calculated the
   As we can see in Table 2, the 24.08% of the        BLEU scores (Papineni et al., 2002) obtained on
termbase entries are also in the source side of the   the test set. Table 3 shows the results for both en-
two training corpora, and 29.19% are in the target    gines, where the engine without terminology is re-
side, meaning that the two resources complement       ferred to as w/o terms and the one with terminol-
each other well.                                      ogy is referred to as w/ terms.
                                                         Furthermore, we evaluated the systems focusing
                        It           En
                                                      on their performance on terminology translation.
  Europarl lemmas       7,848,936    8,046,827        To this purpose, we relied on the f-score. More in
  In-domain lemmas      441,030      399,395          detail, for both engines we extracted the number of
  Termbase entries      4,142        4,142            English termbase entries appearing in the system
  Europarl overlap      23.03%       29.20%           output and in the reference translation. Exploiting
  In-domain overlap     27.52%       29.33%           these figures, we were able to compute Precision,
  Total overlap         24.08%       29.19%           Recall and f-score. Results are reported in Table
                                                      4.
Table 2: Number of lemmas in the generic and in-
domain training sets, termbase entries, and cover-                          Engine      BLEU
age of the termbase wrt. training data.                                     w/o terms   25.92
                                                                            w/ terms    26.00
  4
     http://iate.europa.eu/
  5
     http://eacea.ec.europa.eu/education/                      Table 3: BLEU score for the two engines.
eurydice/
   6
     Lemmatisation was performed using the TreeTag-
                                                           7
ger: https://goo.gl/JjHMcZ                                     http://www.modernmt.eu/
                         w/o terms     w/ terms           produced by the two engines, still there is a con-
    Terms in ref         1,133         1,133              siderable number of terms that are different. We
    Terms in output      1,061         1,083              thus cannot exclude an influence of the termbase
    Correct terms        633           630                on the word choice of the w/ terms system. For this
    Precision            0.596         0.581              reason, an in-depth analysis of the different terms
    Recall               0.558         0.555              produced by the two engines was carried out.
    F-score              0.577         0.568
                                                          5   Manual Evaluation
Table 4: Number of occurrences of termbase en-
tries in the reference and in the output texts, num-      The analysis of the sentences where the termbase
ber of terms in the reference appearing also in the       entries used by the two engines differed showed
outputs, Precision, Recall and F-score.                   that in some cases the termbase forced the system
                                                          to use its target term even if a different transla-
                                                          tion - sometimes also correct - was present in the
   The figures in Tables 3 and 4 show that adding         training corpora. Some examples are reported in
our termbase to the training data set does not af-        Table 5. For the source words prova orale (Ex-
fect the output in a substantial way. While ac-           ample 1) and esame scritto (Example 2), the en-
cording to the BLEU score the w/ terms engine             gine w/ terms used oral examination and written
slightly outperforms the w/o terms engine, the f-         examination, while the one w/o terms used writ-
score – indicating performance on term translation        ten exam and oral exam, but only the occurrences
– is marginally higher for the w/o terms system.          with examination are in the termbase. More-
   Focusing on the usage of terminology, a number         over, Example 2 also includes the termbase word
of observations can be made. As regards the dis-          preparazione, which is translated with preparation
tribution of termbase entries in the test set - which     by the engine w/ terms, while it is not translated at
contains 3,465 sentence pairs - it is interesting to      all by the engine w/o terms.
know that the number of output and reference sen-            Another interesting example is the translation
tences containing at least one term is fairly low, i.e.   of the source word docente (Example 3), where
945 (27.30%) for the reference text, 866 (24.99%)         the termbase corrected a wrong translation. The
for the w/o terms output and 870 (25.10%) for the         Italian term was wrongly translated with lecture
w/ terms output.                                          by the engine w/o terminology, and with teacher -
   Considering the terms found in the two out-            which is the right translation for this text - by the
puts, we observe that their number only differs           engine w/ terminology.
by 23 units (ca. 2% of the number of terms in                In Example 4, the Italian sentence contained the
the outputs). Also, the number of overlapping             termbase entry voto finale, which was translated
terms is very high, i.e. 882 terms (out of 1,061          with final vote by the engine w/o terms and with
for the engine w/o terms and out of 1,083 for the         the termbase MWE final mark by the w/ terms en-
engine w/ terms). As a matter of fact, the top-           gine. Also in this case the termbase corrected a
6 frequent terms in the systems’ outputs are the          mistake, since vote is not the correct translation of
same – course, oral, ability, lecture, technology         voto in this context.
and teacher – and cover approximately a half of              The comparison between the two engines’ out-
the total amount of extracted terms for both out-         puts shows that, even though our training data cov-
puts.                                                     ered the total amount of terms present in the test
   We then compared the English termbase entries          set, the termbase influenced the MT output of the
appearing in the target side of the test set to those     engine w/ terms biasing the weights assigned to a
appearing in the training set. Each of the 78 terms       specific translation.
occurring at least one time in the test set (corre-          Such results have to be judged taking into ac-
sponding to 1,133 total occurrences as reported in        count the preliminary nature of this study, aimed at
Table 4), also occur in the training set – out of         understanding the practical implications of using
which 60 in its in-domain component.                      terminology in PBSMT, and therefore exploiting
   However, even though our training data cover           a simplistic approach to inject terms. As a matter
the total amount of terms present in the test data,       of fact, we found that also some of the termbase
and despite the high overlap between the terms            entries occurring in the reference – e.g. certifica-
             SRC         La prova orale si svolgerà sugli argomenti del programma del corso.
             REF         The oral verification will be on the topics of the lectures.
             W / O TERMS The oral exam will take place on the program of the course.                !
             W / TERMS The oral examination will take place on the program of the course.           !
             SRC         La preparazione dello studente sarà valutata in un esame scritto.
             REF         Student preparation shall be evaluated by a 3 hrs written examination.
             W / O TERMS The student will be evaluated in a written exam.                           %
             W / TERMS The preparation of the student will be evaluated in a written examination.   !
             SRC         Ogni docente titolare
             REF         Each lecturer.
             W / O TERMS Every lecture.                                                             %
             W / TERMS Every teacher.                                                               !
             SRC         In tal caso il voto finale terrà conto anche della prova orale.
             REF         In this case the final score will be based also on the oral part.
             W / O TERMS In this case the final vote will take account the oral test.               %
             W / TERMS In this case the final mark will be based also the oral test.                !

Table 5: MT output examples showing the influence of the termbase on the word choice of the w/terms
engine. Note that the ! and % marks refer to human assessment and not to the correspondence with the
reference.

tion, instructor, text book, educational material –             (see sect. 1) as well. In our future works we are
were not used in the output of the system w/ terms              therefore planning to focus not only on academic
and this is probably due to the limitations of our              terminology, but also on the disciplinary one test-
method. The terms instructor, text book and ed-                 ing its impact on the output of an MT engine trans-
ucational material did not occur in the w/o terms               lating course unit descriptions.
output neither, while certification did.                           After this first experiment on the widely-
   To sum up, what emerges is that using terminol-              used PBSMT architecture, in future work we are
ogy in PBSMT to translate course catalogues - and               planning to exploit neural machine translation
more specifically course unit descriptions - can in-            (NMT). In particular, our goal is to develop an
fluence the MT output. In our case, since the im-               NMT engine able to handle terminology correctly
provements were measured against the output of                  in this text domain, in order to investigate its ef-
the w/o terms engine - which might eventually be                fect on the post-editor’s work. For this reason, a
correct even if using different terms from those in-            termbase focused on the institutional academic do-
cluded in the termbase - the metrics results were               main, e.g. the UCL-K.U.Leuven University Ter-
not informative enough and a manual analysis of                 minology Database8 or the Innsbrucker Termbank
the terms had to be carried out.                                2.09 could be used to select an adequate bench-
                                                                mark for the development and evaluation of an MT
6   Conclusion and further work                                 engine with a high degree of accuracy in the trans-
                                                                lation of terms.
This paper has described a preliminary analysis
aimed at assessing the use of in-domain terminol-
ogy in PBSMT in the institutional academic do-                  Ackowledgements
main, and more precisely for the translation of
course unit descriptions from Italian into English.             The authors would like to thank Silvia Bernar-
Following the results of the present experiment                 dini, Marcello Soffritti and Adriano Ferraresi from
and given its preliminary nature, we are planning               the University of Bologna for their advice on ter-
to carry out further work in this field.                        minology and institutional academic communica-
   In section 4 we have seen that the institutional             tion, and Mauro Cettolo from FBK for help with
academic terms contained in our testing data also               ModernMT. The usual disclaimers apply.
appeared in the training data, thus limiting the
impact of terminology on the output. However,
course catalogues and course unit descriptions in-                 8
                                                                       https://goo.gl/huoevR
                                                                   9
clude terms belonging to the specific disciplines                      https://goo.gl/W2GH5h
References                                                 AAMT, Phuket, Thailand, pages 79–86. http://mt-
                                                           archive.info/MTS-2005-Koehn.pdf.
Mihael Arcan, Claudio Giuliano, Marco Turchi, and
  Paul Buitelaar. 2014b.     Identification of bilin-    Philipp Koehn and Josh Schroeder. 2007.         Ex-
  gual terms from monolingual documents for sta-           periments in domain adaptation for statisti-
  tistical machine translation.      In Proceedings        cal machine translation.        In Proceedings of
  of the 4th International Workshop on Computa-            the Second Workshop on Statistical Machine
  tional Terminology. Dublin, Ireland, pages 22–31.        Translation. Association for Computational
  http://www.aclweb.org/anthology/W14-4803.                Linguistics, Prague, StatMT ’07, pages 224–227.
                                                           http://dl.acm.org/citation.cfm?id=1626355.1626388.
Mihael Arcan, Marco Turchi, Sara Tonelli, and Paul
  Buitelaar. 2014a. Enhancing statistical machine        Philippe Langlais. 2002.        Improving a general-
  translation with bilingual terminology in a CAT          purpose statistical translation engine by termino-
  environment. In Yaser Al-Onaizan and Michel              logical lexicons. In COLING-02 on COMPUT-
  Simard, editors, Proceedings of AMTA 2014. Van-          ERM 2002: Second International Workshop on
  couver, BC.                                              Computational Terminology - Volume 14. Asso-
                                                           ciation for Computational Linguistics, Strouds-
Nicola Bertoldi, Roldano Cattoni, Mauro Cet-
                                                           burg, PA, USA, COMPUTERM ’02, pages 1–7.
  tolo, Amin Farajian, Marcello Federico, Davide
                                                           https://doi.org/10.3115/1118771.1118776.
  Caroselli, Luca Mastrostefano, Andrea Rossi,
  Marco Trombetti, Ulrich Germann, and David        Kishore Papineni, Salim Roukos, Todd Ward, and
  Madl. 2017. MMT: New open source MT for the         Wei-Jing Zhu. 2002. Bleu: A method for au-
  translation industry. In Proceedings of the 20th    tomatic evaluation of machine translation.    In
  Annual Conference of the European Association       Proceedings of the 40th Annual Meeting of the
  for Machine Translation. Prague, pages 86–91.       Association for Computational Linguistics. Asso-
  https://ufal.mff.cuni.cz/eamt2017/user-project-     ciation for Computational Linguistics, Philadel-
  product-papers/papers/user/EAMT2017 paper 88.pdf.   phia, Pennsylvania, ACL ’02, pages 311–318.
Dhouha Bouamor, Nasredine Semmar, and Pierre          https://doi.org/10.3115/1073083.1073135.
  Zweigenbaum. 2012.            Identifying bilingual
                                                         Sanja Štajner, Andreia Querido, Nuno Rendeiro,
  multi-word expressions for statistical machine
                                                           João António Rodrigues, and António Branco. 2016.
  translation. In Nicoletta Calzolari, Khalid Choukri,
                                                           Use of domain-specific language resources in ma-
  Thierry Declerck, Mehmet Uğur Doğan, Bente
                                                           chine translation. In Nicoletta Calzolari, Khalid
  Maegaard, Joseph Mariani, Jan Odijk, and
                                                           Choukri, Thierry Declerck, Sara Goggi, Marko Gro-
  Stelios Piperidis, editors, Proceedings of the
                                                           belnik, Bente Maegaard, Joseph Mariani, Helene
  Eighth International Conference on Language
                                                           Mazo, Asuncion Moreno, Jan Odijk, and Stelios
  Resources and Evaluation (LREC-2012). Euro-
                                                           Piperidis, editors, Proceedings of the Tenth Interna-
  pean Language Resources Association (ELRA),
                                                           tional Conference on Language Resources and Eval-
  Istanbul, Turkey, pages 674–679.          ACL An-
                                                           uation (LREC 2016). European Language Resources
  thology Identifier: L12-1527. http://www.lrec-
                                                           Association (ELRA), Paris, France, pages 592–598.
  conf.org/proceedings/lrec2012/pdf/886 Paper.pdf.
Jorge Civera and Alfons Juan. 2007.         Domain
   adaptation in statistical machine translation
   with mixture modelling.       In Proceedings of
   the Second Workshop on Statistical Machine
   Translation. Association for Computational Lin-
   guistics, Prague, Czech Republic, pages 177–180.
   http://www.aclweb.org/anthology/W/W07/W07-
   0222.
European Commission, EACEA, and Eurydice. 2015.
  The European Higher Education Area in 2015:
  Bologna Process Implementation Report. Luxem-
  bourg: Publications office of the European Union.
Adriano Ferraresi. 2017. Terminology in European
  university settings. The case of course unit de-
  scriptions. In Paola Faini, editor, Terminological
  Approaches in the European Context. Cambridge
  Scholars Publishing, Newcastle upon Tyne, pages
  20–40.
Philipp Koehn. 2005. Europarl: A Parallel Corpus for
  Statistical Machine Translation. In Conference Pro-
  ceedings: the tenth Machine Translation Summit.