On the Development of Customized Neural Machine Translation Models


                             Mauro Cettolo, Roldano Cattoni, Marco Turchi
                                Fondazione Bruno Kessler, Trento, Italy
                              {cettolo,cattoni,turchi}@fbk.eu


                        Abstract                                dress the problem of the quantitative and qualita-
                                                                tive inadequacy of parallel data necessary to de-
    Recent advances in neural modeling                          velop translation models. Among others, deeply
    boosted performance of many machine                         investigated methods are: corpus filtering (Koehn
    learning applications. Training neural net-                 et al., 2020), data augmentation such as data
    works requires large amounts of clean                       selection (Moore and Lewis, 2010; Axelrod et
    data, which are rarely available; many                      al., 2011) and back-translation (Bertoldi and Fed-
    methods have been designed and inves-                       erico, 2009; Sennrich et al., 2016), model adapta-
    tigated by researchers to tackle this is-                   tion (Luong and Manning, 2015; Chu and Wang,
    sue. As a partner of a project, we were                     2018). They should be the working tools of any-
    asked to build translation engines for the                  one who has to develop neural MT models for spe-
    weather forecast domain, relying on few,                    cific language pairs and domains.
    noisy data. Step by step, we developed                          This paper reports on the development of neural
    neural translation models, which outper-                    MT models for translating forecast bulletins from
    form by far Google Translate. This pa-                      German into English and Italian, and from Ital-
    per details our approach, that - we think                   ian into English and German. We were provided
    - is paradigmatic for a broader category of                 with in-domain parallel corpora for each language
    applications of machine learning, and as                    pair but not in sufficient quantity to train a neural
    such could be of widespread utility.                        model from scratch. Moreover, from the prelim-
                                                                inary analysis of data, the English side resulted
1    Introduction                                               noisy (e.g. missing or partial translations, mis-
The field of machine translation (MT) has experi-               aligned sentences, etc.), affecting the quality of
enced significant advances in recent years thanks               any pair involving that language. For this very rea-
to improvements in neural modeling. On the one                  son, we focus on one of the pairs involving English
hand, this represents a great opportunity for indus-            we had to cover, namely Italian-English.
trial MT, on the other it also poses the great chal-                An overview of the in-domain data and the de-
lenge of collecting large amounts of clean data,                scription of their analysis are given in Section 2,
needed to train neural networks. MT training data               highlighting the issues that emerged. Section 3 de-
are parallel corpora, that is collections of sentence           scribes the previously listed methods together with
pairs where a sentence in the source language is                their employment in our specific use-case. De-
paired with the corresponding translation in the                veloped neural translation models are itemized in
target language. Parallel corpora are typically                 Section 4, where their performance are compared
gathered from any available source, in most cases               and discussed; our best models outperform by far
the web, without much guarantees about quality                  Google Translate and some examples will give a
nor domain homogeneity.                                         grasp of the actual translation quality.
   Over the years, the scientific community has                     We think that our approach to the specific prob-
accumulated a lot of knowledge on ways to ad-                   lem we had to face is paradigmatic for a broader
                                                                category of machine learning applications, and we
     Copyright © 2021 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       hope that it will be useful to the whole NLP scien-
ternational (CC BY 4.0).                                        tific community.
2       Data                                              2.2    Analysis and Issues
We were provided with two csv files of weather            As a good practice before starting the creation of
forecast bulletins, issued by two different forecast      MT models, data have been inspected and ana-
services that from here on are identified with the        lyzed looking for potential problems. Several crit-
acronyms BB and TT. Each row of the BB csv con-           ical issues emerged, which are described in the fol-
tains, among other things, the text of the original       lowing paragraphs.
bulletin written in German and, possibly, its trans-      Non-homogeneity of data - Since data originated
lation into Italian and/or English; in the TT csv         from two distinct weather forecast services (BB
rows, the Italian bulletin is paired with its transla-    and TT), first of all it must be established whether
tion into German and/or English.                          they are linguistically similar and, if so, to what
2.1      Statistics                                       extent. For this purpose, focusing on the lan-
                                                          guages of the it-en benchmarks, we measured the
BB Bulletins were extracted from the BB csv file          perplexity of the BB and TT test sets on n-gram
and paired for any possible combination of lan-           language models (LMs) estimated on the BB and
guages. Each bulletin is stored on a single line          TT training sets:2 the closer the perplexity values
but split in a few dozen fields; the average length       of a given text on the two LMs, the greater the lin-
of each field (about 18 German words) is appro-           guistic similarity of BB and TT training sets. Ta-
priate for MT systems, which process long sen-            ble 3 reports values of perplexity (PP) and out-of-
tences with difficulty. Table 1 shows statistics of       vocabulary rates (%OOV) for all test sets vs. LMs
the training and test sets for the it-en language pair.   combinations.3
 site task set      #seg    #src w   #trg w                                  LM trained on
           trn-nsy 30,957 626,211 505,688                                 BB trn        TT trn
 BB it-en tst-nsy 20,000 376,553 298,560                                PP %OOV PP %OOV
           tot     50,957 1,002,764 804,248                     BB tst 10.8    0.22 92.0    12.07
                                                             it
                                                                TT tst 42.4    0.60 10.3     0.41
Table 1: Statistics of the BB it-en benchmark. The
                                                                BB tst 8.9     0.14 80.1     8.49
label nsy will be clear after reading Section 3.2.           en
                                                                TT tst 65.6    2.05 12.7     0.51

TT Bulletins were extracted from the TT csv file           Table 3: Cross comparison of BB and TT texts.
and paired for each language combination. Dif-
ferently than the BB case, each TT bulletin was              Overall, we can notice that the PP of the two test
stored on a single line without any field split;          sets significantly varies when computed on in- and
since bulletins are quite long for automatic pro-         out-of-domain data. The PP of any given test set is
cessing (on average 30 Italian words) and are             4 (42.4 vs. 10.8) to 9 (92.0 vs. 10.3) times higher
the concatenation of rather heterogeneous sen-            when measured on the LM estimated on the text
tences, we decided to segment them by splitting on        of the other provider than on the text of the same
strong punctuation. This requires a re-alignment          provider. These results highlight the remarkable
of source/target segments because in general they         linguistic difference between the bulletins issued
differ in number. The re-alignment was performed          by the two forecast services.
by means of the hunalign sentence aligner1 (Varga         In-domain data scarcity - Current state-of-the-
et al., 2005). Table 2 shows statistics of the train-     art MT neural networks (Section 4.1) have dozens
ing and test sets for the it-en language pair.            to hundreds million parameters that have to be es-
                                                          timated from data. Unfortunately, the amount of
    site task     set #seg #src w    #trg w
                                                          provided data does not allow an effective estima-
                  trn 5,177 78,834 73,763
                                                          tion from scratch of such a huge number of param-
    TT      it-en tst 1,962 30,232 28,135
                                                          eters, as we will empirically prove in Section 4.3.
                  tot 7,139 109,066 101,898
                                                             2
                                                                3-gram LMs with modified shift beta smoothing were es-
    Table 2: Statistics of the TT it-en benchmark.        timated using the IRSTLM toolkit (Federico et al., 2008).
                                                              3
                                                                In order to isolate the genuine PP of the text, the dictio-
                                                          nary upperbound to compute OOV word penalty was set to 0;
    1
        github.com/danielvarga/hunalign                   the OOV rates are shown for this very reason.
BB English side - BB data have a major problem           participants, their methods and results. For ref-
on the English side. In fact, looking at csv file,       erence purposes, organizers set up a competitive
we realized that many German bulletins were not          baseline based on LASER (Language-Agnostic
translated at all into English. Moreover, in the En-     SEntence Representations)4 (Schwenk and Douze,
glish side there are 20% fewer words than in the         2017) multilingual sentence embeddings. The un-
corresponding German or Italian sides, a differ-         derlying idea is to use the cosine distance between
ence that is not justified by the morpho-syntactic       the embeddings of the source and the target sen-
variations between languages. In fact, it happens        tences to measure their parallelism. In a similar
that entire portions of the original German bul-         way we cleaned the BB noisy benchmark, filtering
letins are not translated into English, or that a def-   with a threshold of 0.9; statistics of the resulting
initely more compact form is used, as in:                bi-text are given in Table 4.
de: Der Hochdruckeinfluss hält bis auf weiteres an.
en: High pressure conditions.                              site task     set     #seg #src w #trg w
                                                                         trn-cln 1,673 37,629 40,256
   This critical issue affects both training and test
                                                           BB      it-en tst-cln 1,011 20,280 21,657
sets, as highlighted by figures in Table 1; as such,
                                                                         tot     2,684 57,909 61,913
it negatively impacts both the quality of the trans-
lation models, if trained/adapted on such noisy          Table 4: Stats of the filtered BB it-en benchmark.
data, and the reliability of evaluations, if run on
such distorted data. A careful corpus filtering is          The filtered bi-text does not suffer anymore
therefore needed, as discussed in Section 3.2.           from the imbalance number of words but it is 20
                                                         times smaller than the original one.
3     Methods
                                                         3.3     Data Augmentation
3.1    MT Model Adaptation
                                                         Since the corpus filtering discussed in the previous
A standard method for facing the in-domain data
                                                         section removes most of the original data, further
scarcity issue mentioned in Section 2.2 is the
                                                         exacerbating the problem of data scarcity, we tried
so-called fine-tuning: given a neural MT model
                                                         to overcome this unwanted side effect by means of
trained on a large amount of data in one domain,
                                                         data augmentation methods.
its parameters are tuned by continuing the train-
ing using a small amount of data from another do-        3.3.1     Data Selection
main (Luong and Manning, 2015; Chu and Wang,
                                                         A widespreadly adopted data augmentation
2018). Though effective on the new in-domain
                                                         method is data selection. Data selection assumes
data supplied for model adaptation, fine-tuning
                                                         the availability of a large general domain corpus
typically suffers from performance drops on un-
                                                         and a small in-domain corpus; in MT, the aim is to
seen data (test set), unless proper regularization
                                                         extract parallel sentences from the large bilingual
techniques are adopted (Miceli Barone et al.,
                                                         corpus that are most relevant to the target domain
2017). We avoid overfitting by fine-tuning our MT
                                                         as defined by the small corpus.
models with dropout (set to 0.3) (Srivastava et al.,
                                                            On the basis of the bilingual cross-entropy dif-
2014) and performing only a limited number of
                                                         ference (Axelrod et al., 2011), we sorted the sen-
epochs (5) (Miceli Barone et al., 2017).
                                                         tence pairs of the OPUS collection,5 used as gen-
3.2    Corpus Filtering                                  eral domain large dataset, according to their rel-
Machine learning typically requires large sets of        evance to the domain determined by the concate-
clean data. Since rarely large data sets are also        nation of the BB and TT training sets. To estab-
clean, researchers devoted much effort to data           lish the optimal size of the selection, we trained
cleaning, the automatic process to identify and re-      LMs - created in the same setup described in non-
move errors from data. The MT community is no            homogeneity of data paragraph of Section 2.2 - on
exception. Even, WMT - the conference on ma-             increasing amounts of selected data and computed
chine translation - in 2018, 2019 and 2020 edi-          the PP of BB and TT test sets, separately for each
tions organized a Shared Task on Parallel Corpus         side. Figure 1 plots the curves; the straight lines on
Filtering. Koehn et al. (2020) provide details on           4
                                                                github.com/facebookresearch/LASER
                                                            5
the task proposed in the more recent edition, on                opus.nlpl.eu
the bottom correspond to the PP of the same test                          #segments #src w #trg w
sets on LMs built on the in-domain training sets.                 it-en     32.0M   339M 352M
                                                        Table 6: Stats of the parallel generic training sets.


                                                        lation into Italian of the 31k English segments of
                                                        the training set (Table 1) was performed by an
                                                        in-house generic en-it MT engine (details in Ap-
                                                        pendix A.1 of (Bentivogli et al., 2021)). Row
                                                        BT of Table 5 shows the statistics of this artifi-
                                                        cial bilingual corpus; similarly to what happened
                                                        with the filtering process, the numbers of Italian
                                                        and English words are much more compatible than
                                                        they are in the original version of the corpus.
Figure 1: Perplexity of test sets on LMs estimated
on increasing amounts of selected data.                 4       Experimental Results
   The form of curves is convex, as usual in data       4.1      MT Engine
selection. In our case, the best trade-off between
the pertinence of data and its amount occur when        The MT engine is built on the ModernMT
something more than a million words is selected;        framework6 which implements the Trans-
therefore, we decided to mine from OPUS the             former (Vaswani et al., 2017) architecture. The
bilingual text whose size is given in row DS of         original generic model is Big sized, as defined
Table 5. Anyway, note that the lowest PP for se-        in (Vaswani et al., 2017) by more than 200
lections is at least one order of magnitude greater     million parameters. For training, bi-texts were
than on LMs trained on in-domain training sets.         downloaded from the OPUS repository5 and
                                                        then filtered through the already mentioned data
  task  set #seg     #src w    #trg w                   selection method (Axelrod et al., 2011) using a
        DS 206,990 1,352,623 1,312,068                  general-domain seed. Statistics of the resulting
  it-en                                                 corpus are provided in Table 6. Training was
        BT 30,957    482,398   505,688
                                                        performed in the setup detailed in (Bentivogli et
Table 5: Stats of selected and back translated data.    al., 2021).
                                                           The same Big model and its smaller variants,
3.3.2    Back Translation                               the Base with 50 million parameters and the Tiny
Another well-known data-augmentation method,            with 20 million parameters, were also trained on
which somehow also represents an alternative            in-domain data only for the sake of comparison.
way to corpus filtering for dealing with the BB
English side issue, is back-translation. Back-          4.2      MT Models
translation (Bertoldi and Federico, 2009; Sennrich      We empirically compared the quality of trans-
et al., 2016; Edunov et al., 2018) assumes the          lations generated by various MT models: two
availability of an MT system from the target lan-       generic, three genuine in-domain of different size
guage to the source language and of target mono-        and several variants of our generic model adapted
lingual data. The MT system is used to translate        (Section 3.1) on in-domain data resulting from the
the target monolingual data into the source lan-        presented methods: filtering (Section 3.2), data se-
guage. The result is a parallel corpus where the        lection (Section 3.3.1) and back-translation (Sec-
source side is the synthetic MT output while the        tion 3.3.2). Performance was measured on the
target is human text. The synthetic parallel cor-       BB and TT test sets in terms of BLEU (Pap-
pus is then used to train or adapt a source-to-target   ineni et al., 2002), TER (Snover et al., 2006) and
MT system. Although simple, this method has             CHRF (Popović, 2015) scores computed by means
been shown to be very effective. We used back-          of SacreBLEU (v1.4.14) (Post, 2018), with default
translation to generate a synthetic, but hopefully
                                                            6
cleaner, version of the BB training set. The trans-             github.com/modernmt/modernmt
                                                       BB                                   TT
          MT model                  noisy test set           clean test set               test set
                               %BLEU↑ %TER↓ CHRF↑ %BLEU↑ %TER↓ CHRF↑ %BLEU↑ %TER↓ CHRF↑
Generic models:
 GT⋆                   11.45 106.61 .3502 32.59 51.72 .6104 32.20                          61.56 .6315
 FBK (Transformer big) 07.43 113.07 .3833 19.68 63.68 .5229 23.45                          70.46 .5525
Pure in-domain models trained on BBtrn-nsy+TTtrn:
 Transformer tiny      23.34 83.86 .4882 35.80 61.05 .5808 42.19                           51.79 .6488
 Transformer base      18.39 93.41 .4590 22.06 85.91 .5237 29.17                           64.73 .5351
 Transformer big       20.45 95.76 .4755 24.73 89.26 .5330 28.01                           68.42 .5193
FBK model adapted on:
 BBtrn-nsy             21.211 80.822 .47852 37.913 46.913 .6172 13.77                      79.14 .4007
 BBtrn-cln             10.67 108.86 .4195 31.57 52.54 .5950 27.68                          65.05 .5912
 TTtrn                 10.44 107.48 .4241 28.64 54.20 .5800 39.61                          52.64 .6702
 DS                    10.82 109.71 .4255 30.11 54.86 .5873 29.76                          63.68 .6099
 BT                    12.50 106.85 .4507 34.85 49.78 .6339 32.71                          58.95 .6372
 BBtrn-nsy+TTtrn       19.303 79.291 .4449 32.81 52.38 .5680 40.513                        51.973 .6579
 BBtrn-nsy+TTtrn+DS+BT 19.362 86.333 .47921 41.171 44.671 .64882 40.692                    51.842 .67343
 BBtrn-cln+TTtrn       12.39 105.36 .4450 37.02 47.40 .63653 40.34                         52.16 .67552
 BBtrn-cln+TTtrn+DS+BT 13.75 104.59 .46193 40.092 45.282 .66171 41.161                     51.011 .68031

Table 7: BLEU/TER/CHRF scores of MT models on it-en test sets. 1 , 2 and 3 indicate the “podium
position” among the adapted models of each column. (⋆ ) Google Translate, as it was on 14 Sep 2021.


signatures.7                                           (from 37.02 to 40.09) when DS and BT data are
                                                       added to the adaptation corpus.
4.3    Results and Comments                               The fine-tuning of a Transformer big generic
Scores are collected in Table 7. First, as ex-         model to the weather forecast domain turned out
pected (in-domain data scarcity paragraph of Sec-      to be more effective than any training from scratch
tion 2.2), it is not feasible to properly train a      using original in-domain data only: the top per-
huge number of parameters with few data; in            forming model - BBtrn-cln+TTtrn+DS+BT - def-
fact, the best performing pure in-domain model is      initely improves the Transformer tiny with re-
the smallest one (Transformer tiny). Instead, the      spect to all metrics on the BB clean test set
naive application of the MT state-of-the-art would     (40.09/45.28/.6617 vs 35.80/61.05/.5808), and to
have led to simply train a Transformer big model       two metrics out of three on the TT test set (TER:
on the original in-domain data. This model would       51.01 vs. 51.79, CHRF: .6803 vs. .6488). More-
not have been competitive with GT on TT data           over, all its scores are a lot better than those of
(28.01 vs. 32.20 BLEU); it would have been on          Google Translate.
BB data if we had only considered the noisy test
set (20.45 vs. 11.45) resulting in an important mis-   4.4   Examples
interpretation of the actual quality of the two sys-   To give a grasp of the actual quality of automatic
tems; conversely, our preliminary analysis allowed     translations, Table 8 collects the English text gen-
us to discover the need of cleaning BB data, which     erated by some of the tested MT models fed with a
guarantees a reliable assessment (24.73 vs. 32.59).    rather complex Italian source sentence. The man-
   Data augmentation methods (DS, BT) are both         ual translations observed in BB data are shown as
effective in making available additional useful bi-    well: their number, their variety, some question-
texts; for example, the BLEU score of the model        able/wrong lexical choices in them (“high” instead
BBtrn-cln+TTtrn increases by 3 absolute points         of “upper-level currents”, “South-western” instead
   7
                                                       of “Southwesterly”) and one totally wrong (“Weak
    BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a,
TER+tok.tercom-nonorm-punct-noasian-uncased,           high pressure conditions.”) prove the difficulty of
chrF2+numchars.6+space.false                           learning from such data and the need to pay par-
    Italian source sentence:
           Le correnti in quota si disporranno da sudovest avvicinando masse d’aria più umida alle Alpi.
    Manual English translations found in BB bulletins:
           Weak high pressure conditions.
           The high currents will turn to south-west and humid air mass will reach the Alps.
           Southwesterly currents will bring humid air masses to South Tyrol.
           South-western currents will bring humid air masses to the Alps.
           South-westerly upper level flow will bring humid air masses towards our region.
           More humid air masses will reach the Alps.
           Humid air reaches the Alps with South-westerly winds.
    Automatic English translations generated by some MT models:
    GT     The currents at high altitudes will arrange themselves from the southwest, bringing more
           humid air masses closer to the Alps.
    FBK Currents in altitude will be deployed from the southwest, bringing wet air masses closer to
           the Alps.
    Transformer tiny               South-westerly upper level flow will bring humid air masses towards
                                   the Alps.
    BBtrn-cln+TTtrn+DS+BT The upper level flow will be arranged from the southwest approaching
                                   more humid air masses to the Alps.

                          Table 8: Examples of manual and automatic translations.


ticular attention to the evaluation phase. Concern-      noisy and heterogeneous. We faced these issues
ing translations, GT is able to keep most of the         by exploiting a number of methods which repre-
meaning of the source text but the translation is        sent established knowledge of the scientific com-
too literal to result in fluent English. FBK only        munity: adaptation of neural models, corpus fil-
partially transfers the meaning from the source          tering and data augmentation techniques such as
and generates a rather bad English text. Trans-          data selection and back-translation. In particular,
former tiny provides a very good translation both        corpus filtering allowed us to avoid the misleading
from a semantic and a syntactic point of view, los-      results observed on the original noisy data, while
ing only the negligible detail that the “air masses”     adaptation and data augmentation proved useful in
are “more humid”, not simply “humid”. Finally,           effectively taking advantage of out-of-domain re-
BBtrn-cln+TTtrn+DS+BT, the model that on the             sources.
basis of our evaluations is the best one, on this spe-
cific example works very well at the semantic level
but rather poorly on the grammatical level.              References
   This example shows that pure in-domain mod-           Amittai Axelrod, Xiaodong He, and Jianfeng Gao.
els, as expected, are “more in-domain” than               2011. Domain Adaptation via Pseudo In-Domain
generic models, though adapted, showing greater           Data Selection. In Proc. of EMNLP, pages 355–362,
                                                          Edinburgh, Scotland, UK.
adherence to domain-specific language. On the
other hand, according to scores in Table 7, adapted      Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina
models should be better in generalization. Only            Karakanta, Alberto Martinelli, Matteo Negri, and
subjective evaluations involving meteorologists            Marco Turchi. 2021. Cascade versus Direct Speech
                                                           Translation: Do the Differences Still Make a Differ-
can settle the question of which model is the best.
                                                           ence? In Proc. of ACL/IJCNLP (Volume 1: Long
                                                           Papers), pages 2873–2887, Bangkok, Thailand.
5     Conclusions
                                                         Nicola Bertoldi and Marcello Federico. 2009. Domain
In this paper we described the development pro-            Adaptation for Statistical Machine Translation with
cess that led us to build competitive customized           Monolingual Resources. In Proc. of WMT, pages
                                                           182–189, Athens, Greece.
translation models. Given the provided in-domain
data, we started by analyzing them under sev-            Chenhui Chu and Rui Wang. 2018. A Survey of Do-
eral perspectives and discovered that they are few,        main Adaptation for Neural Machine Translation. In
  Proc. of COLING, pages 1304–1319, Santa Fe, US-       Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
  NM.                                                     Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
                                                          Dropout: A Simple Way to Prevent Neural Networks
Sergey Edunov, Myle Ott, Michael Auli, and David          from Overfitting. Journal of Machine Learning Re-
  Grangier. 2018. Understanding Back-Translation          search, 15(56):1929–1958.
  at Scale. In Proc. of EMNLP, pages 489–500, Brus-
  sels, Belgium.                                        Dániel Varga, Péter Halácsy, András Kornai, Nagy Vik-
                                                           tor, Nagy Laszlo, N. László, and Tron Viktor. 2005.
Marcello Federico, Nicola Bertoldi, and Mauro Cet-         Parallel Corpora for Medium Density Languages.
 tolo. 2008. IRSTLM: An Open Source Toolkit for            In Proc. of RANLP, pages 590–596, Borovets, Bul-
 Handling Large Scale Language Models. In Proc. of         garia.
 Interspeech, pages 1618–1621, Brisbane, Australia.
                                                        Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
                                                          Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Philipp Koehn, Vishrav Chaudhary, Ahmed El-Kishky,        Kaiser, and Illia Polosukhin. 2017. Attention is
  Naman Goyal, Peng-Jen Chen, and Francisco               All You Need. In Proc. of NIPS, pages 5998–6008,
  Guzmán. 2020. Findings of the WMT 2020 Shared          Long Beach, US-CA.
  Task on Parallel Corpus Filtering and Alignment. In
  Proc. of WMT, pages 726–742, Online.

Minh-Thang Luong and Christopher D. Manning.
  2015. Stanford Neural Machine Translation Sys-
  tems for Spoken Language Domains. In Proc. of
  IWSLT, pages 76–79, Da Nang, Vietnam.

Antonio Valerio Miceli Barone, Barry Haddow, Ulrich
  Germann, and Rico Sennrich. 2017. Regulariza-
  tion Techniques for Fine-tuning in Neural Machine
  Translation. In Proc. of EMNLP, pages 1489–1494,
  Copenhagen, Denmark.

Robert C. Moore and William Lewis. 2010. Intelli-
  gent Selection of Language Model Training Data. In
  Proc. of ACL (Short Papers), pages 220–224, Upp-
  sala, Sweden.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  Jing Zhu. 2002. BLEU: a Method for Automatic
  Evaluation of Machine Translation. In Proc. of ACL,
  pages 311–318, Philadelphia, US-PA.

Maja Popović. 2015. chrF: character n-gram F-score
 for automatic MT evaluation. In Proc. of WMT,
 pages 392–395, Lisbon, Portugal.

Matt Post. 2018. A Call for Clarity in Reporting BLEU
 Scores. In Proc. of WMT, pages 186–191, Belgium,
 Brussels.

Holger Schwenk and Matthijs Douze. 2017. Learning
  Joint Multilingual Sentence Representations with
  Neural Machine Translation. In Proc. of RepL4NLP,
  pages 157–167, Vancouver, Canada.

Rico Sennrich, Barry Haddow, and Alexandra Birch.
  2016. Improving Neural Machine Translation Mod-
  els with Monolingual Data. In Proc. of ACL (Volume
  1: Long Papers), pages 86–96, Berlin, Germany.

Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea
 Micciulla, and John Makhoul. 2006. A Study of
 Translation Edit Rate with Targeted Human Annota-
 tion. In Proc. of AMTA, pages 223–231, Cambridge,
 US-MA.