Multilingual Neural Machine Translation for Low Resource Languages

   Surafel M. Lakew                Mattia A. Di Gangi            Marcello Federico
University of Trento, Italy     University of Trento, Italy   Fondazione Bruno Kessler
Fondazione Bruno Kessler        Fondazione Bruno Kessler
                             via Sommarive, 18, Trento, Italy
                        {lakew,digangi,federico}@fbk.eu


                 Abstract                           learning richiedono grandi quantit di dati
                                                    per imparare una mappa efficace tra le
 Neural Machine Translation (NMT) has               due lingue. In questo lavoro mostriamo
 been shown to be more effective in trans-          come un modello NMT multilingua può
 lation tasks compared to the Phrase-Based          aiutare ad affrontare i problemi legati
 Statistical Machine Translation (PBMT).            alla traduzione di LRL. La NMT multilin-
 However, NMT systems are limited in                gua costringe la rappresentrazione delle
 translating low-resource languages (LRL),          parole e dei segmenti di parole in uno
 due to the fact that neural methods require        spazio semantico condiviso tra multiple
 a large amount of parallel data to learn ef-       lingue. Questo consente al modello di
 fective mappings between languages. In             usare un trasferimento di parametri pos-
 this work we show how so-called multi-             itivo tra le lingue coinvolte, senza cam-
 lingual NMT can help to tackle the chal-           biare l’architettura NMT encoder-decoder
 lenges associated with LRL translation.            basata sull’attention e il modo di adde-
 Multilingual NMT forces words and sub-             stramento. Abbiamo eseguito esperimenti
 words representation in a shared semantic          preliminari con tre lingue (inglese, ital-
 space across multiple languages. This al-          iano e rumeno), coprendo sei direzioni
 lows the model to utilize a positive param-        di traduzione e mostriamo che per tutte
 eter transfer between different languages,         le direzioni disponibili l’approccio mul-
 without changing the standard attention-           tilingua, cioè un solo sistema che copre
 based encoder-decoder architecture and             tutte le direzioni è confrontabile o persino
 training modality. We run preliminary ex-          migliore dei singolo sistemi bilingue. In-
 periments with three languages (English,           oltre, il nostro approccio ottiene risultati
 Italian, Romanian) covering six translation        competitivi anche per coppie di lingue non
 directions and show that for all available         viste durante il trainig, facendo uso di
 directions the multilingual approach, i.e.         traduzioni con pivot.
 just one system covering all directions is
 comparable or even outperforms the single
 bilingual systems. Finally, our approach       1   Introduction
 achieve competitive results also for lan-
 guage pairs not seen at training time using    Neural machine translation (NMT) has recently
 a pivoting (x-step) translation.               shown its effectiveness by delivering the best
                                                performance in various evaluation campaigns
 Italiano. La traduzione automatica con         (IWSLT 2016 (Cettolo et al., 2016), WMT
 reti neurali (neural machine translation,      2016 (Bojar et al., 2016)). Unlike rule-based
 NMT) ha dimostrato di essere più efficace     or phrase-based MT, the end-to-end learning ap-
 in molti compiti di traduzione rispetto        proach of NMT models the mapping from source
 a quella basata su frasi (phrase-based         to target language directly through a posterior
 machine translation, PBMT). Tuttavia, i        probability. The basic component of an NMT sys-
 sistemi NMT sono limitati nel tradurre         tem include an encoder, a decoder and an atten-
 lingue con basse risorse (LRL). Questo         tion mechanism (Bahdanau et al., 2014). Despite
 è dovuto al fatto che i metodi di deep        the continuous improvement in performance and
translation quality, NMT models are highly de-        computes a contextual representation of the source
pendent on the availability of large parallel data,   words using an RNN implemented with an LSTM
which in practice can only be acquired for a very     (Hochreiter and Schmidhuber, 1997) or GRU (Cho
limited number of language pairs. For this reason,    et al., 2014):
building effective NMT systems for low-resourced
languages becomes a primary challenge (Koehn                   ht = g(xt , ht−1 ) t = 1, ..., m
and Knowles, 2017). Recently, (Zoph et al.,
                                                      where xt is the embedding for the word at time
2016) showed how a standard string-to-tree statis-
                                                      step t and m is the length of the source sentence.
tical MT system (Galley et al., 2006) can effec-
                                                      The decoder receives as input the embedding of
tively outperform NMT methods for low-resource
                                                      the target word at the previous decoding time step,
languages, such as Hausa, Uzbek, and Urdu. In
                                                      and computes through a RNN a new representa-
this work, we focus on a so-called multilingual
                                                      tion of the current translation, given the represen-
NMT (Johnson et al., 2016; Ha et al., 2016),
                                                      tation in the previous step, and a relevant source
which considers the use of NMT to target many-
                                                      context computed by the attention model. At
to-many language translation. Our motivation
                                                      each time step, the attention computes normalized
is that intensive cross-lingual transfer (Terence,
                                                      weights for the source word positions according
1989) via parameter sharing should ideally help in
                                                      to the hidden state of the decoder, which are then
the case of similar languages and sparse training
                                                      used to compute the source context as a weighted
data. Hence, in this work we investigate multilin-
                                                      sum of all the encoder hidden states. There are
gual NMT across Italian, Romanian, and English,
                                                      several strategies to implement a decoder but all
and simulate low-resource conditions by limiting
                                                      of them end up computing the conditional proba-
the amount of parallel data.
                                                      bility of the next target word depending on the pre-
   Our approach showed a BLEU increase in var-
                                                      viously translated words and the source sentence:
ious language directions, in a low-resource set-
ting. To compare a single language pair NMT                            p(ei = k|e<i , f )
models with a single multilingual NMT (M-NMT)
model, we considered six translation directions       The network is trained end-to-end to find the pa-
(i.e English↔Italian, English↔Romanian, and           rameters Θ̂ that maximizes the log-likelihood of
Italian↔Romanian). For evaluating the zero-shot       the training set {(fs , es ) : s = 1, . . . , S} :
translation (i.e. a translation between language
                                                                      S
pair with no available parallel corpus), we re-                       X
                                                                            log p(es |fs ; Θ)
moved the (Italian↔Romanian) language pairs. In                       s=1
the same way as the six-language-pairs, the perfor-
mance of the four-language-pairs M-NMT model              Based on the end-to-end training approach in
is comparable with the bilingual models for the       NMT, M-NMT models translation across multiple
language directions with parallel data.               languages with a single model. As such, a mul-
   We start in Section 2 with a brief description     tilingual translation task can be categorized into
of NMT and state-of-the-art multilingual NMT ap-      many-to-one, one-to-many, or many-to-many di-
proaches. In Section 3, we give a background on       rections, with increasing difficulty. By employ-
our M-NMT model. In Section 4, we present the         ing one of these scenarios, recent works in multi-
experimental setting and the NMT model config-        lingual NMT have shown the possibility of trans-
urations. In Section 5, we show and discuss the       lating language pairs never seen at training time,
results of the experiments. Finally, in Section 6     in addition to improving baseline bilingual NMT
we present our conclusion and future works.           models (Ha et al., 2016) (Johnson et al., 2016).
                                                          The initial approaches to multilingual NMT
2   State of The Art                                  required modifications on the standard encoder-
                                                      decoder architecture (Zoph and Knight, 2016; Fi-
An NMT system consists of three different models      rat et al., 2016a; Firat et al., 2016b; Dong et al.,
called encoder, decoder and attention (Bahdanau       2015; Luong et al., 2015; Lee et al., 2016). Re-
et al., 2014). The encoder takes as an input a se-    cently, state-of-the-art results are achieved by sim-
quence of words f = f1 , . . . , fm in the form of    ply decorating the network inputs with special lan-
vocabulary indexes, extract their embeddings and      guage tags, to direct the model to a preferred target
language at inference time. In this work, follow-        and 0.1 on the input and output layers. All ex-
ing (Johnson et al., 2016) we add a language token       periments are done using the NMT toolkit Nema-
at the beginning of every source sentence. This to-      tus1 (Sennrich et al., 2017).
ken is unique for the target language and it is a way
to impose the target language in which to translate          Pair       Train    Dev10      Test10     Test17
(target-forcing).                                            En-It     231619    1643        929        1147
                                                             En-Ro     220538    1678        929        1129
3   M-NMT for Low-resource Languages                         It-Ro     217551    1643        914        1127
In this work, we show that it is possible to train
                                                         Table 1: A total number of parallel sentences
a single NMT model for the translation task be-
                                                         used for training and evaluation in a limited low-
tween multiple language pairs in a low-resource
                                                         resource scenario.
setting. In (Ha et al., 2016; Johnson et al., 2016) it
has been shown that a multilingual system trained           For the training set, we used the dataset pro-
on a large amount of data improves over a base-          vided by the latest IWSLT20172 multilingual
line bilingual model, and it is also capable of per-     shared task for all possible language pair combina-
forming zero-shot translation. In this work we fo-       tions between Italian, Romanian and English (Cet-
cus on M-NMT in a resource-scarce (Koehn and             tolo et al., 2012). At the preprocessing stage, we
Knowles, 2017) scenario and show how M-NMT               applied word segmentation by jointly learning the
is never worse than a bilingual system for each of       Byte-Pair Encoding (Sennrich et al., 2015), merg-
the language directions used in the training phase.      ing rules set to 39,500. There is a high overlap
In fact, the multilinguality can be considered as a      between the language pairs (i.e the English dataset
way to increase the available amount of data for         paired with Romanian is highly similar to the En-
language directions with small datasets. More-           glish paired with Italian). Because of this over-
over, only a single system is needed with respect to     lapping, the actual unique sentences in the dataset
several bidirectional NMT systems, thus our set-         are approximately the half of the total size. This
ting also represents a way for saving training time      consequently exacerbates the low-resource aspect
and compresses the number of required parame-            in the multilingual models. The size of the vocab-
ters. The target language can be imposed on the          ulary both in case of the bilingual and the multi-
network by using the previously described target         lingual models stays just under 40,000 sub-words.
forcing.                                                 An evaluation script to determine the BLEU (Pa-
   Furthermore, we use our multilingual model to         pineni et al., 2002) score is used to validate on
perform zero-shot translation. We hope that by           the dev set and later to choose the best perform-
simply applying the target forcing in the zero-shot      ing models.
scenario, the system can generate sentences in the          We trained models for two different scenar-
target language. An alternative zero-shot transla-       ios, the first is the multilingual scenario contain-
tion in a resource-scarce scenario can also be per-      ing all the available language pairs, while the
formed using a pivot language that is, using an in-      second scenario is the zero-shot using pivoting,
termediate language for translation. While this is a     which does not contain parallel sentences for the
known technique in machine translation using two         Romanian↔Italian language pairs. For develop-
or more bilingual models, we expect to achieve a         ment and evaluating the models, we used sets
comparable pivoting results using a single multi-        from the IWSLT 2010 (Paul et al., 2010) and
lingual model.                                           IWSLT2017 evaluation campaign. The inference
                                                         is performed using beam search of size 12.
4   Experimental setting
                                                         5       Results
Our NMT model uses embeddings with dimen-
sion 1024 and RNN layers based on GRUs of the            5.1      Bilingual Vs. Multilingual
same dimension. The optimization algorithm is            In the first scenario, we compare the translation
Adagrad (Duchi et al., 2011) with an initial learn-      performance of independently trained bilingual
ing rate of 0.01 and mini-batches of size 100.               1
                                                              Nematus- https://github.com/EdinburghNLP/nematus
Dropouts are used on every layer, with probabil-             2
                                                              The International Workshop on Spoken Language Trans-
ity 0.2 on the embeddings and the hidden layers          lation - http://workshop2017.iwslt.org/
models against the M-NMT model. In total there         Moreover, the simplicity of using a single model
are six bilingual models, whereas the M-NMT is         instead of six leaves a room for further improve-
trained using the concatenation of all the six lan-    ments by incorporating more language pairs.
guages pair dataset, by just appending an artificial
token on the source side. As shown in Table 2,         5.2   Pivoting using a Multilingual Model
the performance of our systems are evaluated on        The pivoting experiment is setup by dropping the
dev2010 and test2017.                                  Italian-Romanian language pairs from the six di-
   Our preliminary experiments show that the M-        rections M-NMT model, which gives us a four
NMT system favorably compares with the bilin-          directions multilingual model (we call it, PM-
gual systems. Improvements are observed in sev-        NMT), where all the configurations stays the same
eral language directions, which are likely gained      as in M-NMT. Our main aim is to analyze how a
from the cross-lingual parameter transfer between      multilingual model can improve a zero-shot trans-
the additional language pairs involved in the          lation tasks using a pivoting mechanism, using
source and target side.                                English as a bridge language in the experiment.
                                                       Moreover, the use of a multilingual model for piv-
          Direction         NMT      M-NMT             oting is motivated by the results we acquired using
      English→Italian       26.79     26.34            the M-NMT.
      Italian→English       31.43     31.39
    English→Romanian        21.55     22.13             Direction    P-NMT      PM-NMT        ∆ BLEU
    Romanian→English        33.84     34.16              It→Ro        14.14      14.75         +0.61
    Italian→Romanian        15.60     15.92              Ro→It        20.16      19.72         −0.44
    Romanian→Italian        21.00     21.60
                                                       Table 4: Comparison of pivoting with two bilin-
Table 2: Comparison between six bilingual mod-         gual models (P-NMT) against pivoting one multi-
els (NMT) against a single multilingual (M-NMT)        lingual model (PM-NMT). Both approaches use
model. A difference of ≥ 0.5 BLEU score is high-       English as the pivoting language. Italian-Romania
lighted as bold.                                       data was excluded from the training data of the
                                                       multi-lingual model.
   Specifically, the M-NMT showed an improve-
ment of +0.58 and +0.60 for En→Ro and It→Ro               The results in Table 4, show the potential, al-
directions, while having only a small decrease in      though partial, of using multilingual models with
performance for the En→It and It→En directions         pivoting for unseen translation directions. The
(see Table 2).                                         comparable results achieved in both directions
                                                       speak to us in favor of training and deploying one
          Direction         NMT      M-NMT             M-NMT system instead of two distinct NMT.
      English→Italian       27.44     28.22
      Italian→English       29.9      31.84
    English→Romanian        20.96     21.56             Direction    P-NMT      PM-NMT        ∆ BLEU
    Romanian→English        25.44     27.24              It→Ro         16.3      17.58         +1.28
    Italian→Romanian        17.7      18.95              Ro→It        18.69      18.66         −0.03
    Romanian→Italian        19.99     20.72
                                                       Table 5: Comparison of pivoting with two bilin-
Table 3: Comparison between six bilingual mod-         gual models (P-NMT) against pivoting one multi-
els (NMT) against a single multilingual (M-NMT)        lingual model (PM-NMT) using test2017 as the
model on test2017.                                     evaluation set.

  For the evaluation using test2017, however, the        From the evaluation results on test2017, we
M-NMT performed better in all directions than          confirmed that M-NMT can achieve a compara-
the NMT models (see Table 3). These results            ble (Ro→It) or better (It→Ro) result over the two
show that the M-NMT model performs either in           NMT systems used for pivoting. In future work,
a comparable way or outperforms the single lan-        we will investigate if better performance in pivot-
guage pair models in this resource-scarce scenario.    ing can be achieved by increasing the number of
languages covered by the M-NMT system (pos-                2016. The iwslt 2016 evaluation campaign. Proc. of
sibly related to the source and target languages),         IWSLT, Seattle, pp. 14, WA, 2016.
and/or by different choices of the bridging lan-
                                                         Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bah-
guage.                                                     danau, and Yoshua Bengio. 2014. On the properties
                                                           of neural machine translation: Encoder-decoder ap-
6   Conclusions                                            proaches. arXiv preprint arXiv:1409.1259.

In this paper, we used a multilingual NMT model          Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and
in a low-resource language pairs scenario. We              Haifeng Wang. 2015. Multi-task learning for mul-
showed that a single multilingual system achieves          tiple language translation. In ACL (1), pages 1723–
                                                           1732.
comparable performances with the bilingual base-
lines while avoiding the need to train several sin-      John Duchi, Elad Hazan, and Yoram Singer. 2011.
gle language pair models. Then, we showed how              Adaptive subgradient methods for online learning
a multilingual model can be used for zero-shot             and stochastic optimization. Journal of Machine
translation by using a pivot language for achiev-          Learning Research, 12(Jul):2121–2159.
ing slightly lower results than a bilingual model        Orhan Firat, Kyunghyun Cho, and Yoshua Ben-
trained on that language pair. As a future work            gio. 2016a. Multi-way, multilingual neural ma-
we want to explore how the choice of different             chine translation with a shared attention mechanism.
languages can enable a better parameter transfer           arXiv preprint arXiv:1601.01073.
in a single model, using more linguistic features
                                                         Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan,
of the surface word form, and how to achieve a             Fatos T Yarman Vural, and Kyunghyun Cho.
direct zero-shot translation in a low-resource sce-        2016b.     Zero-resource translation with multi-
nario without the pivoting mechanism.                      lingual neural machine translation. arXiv preprint
                                                           arXiv:1606.04164.
Acknowledgments
                                                         Michel Galley, Jonathan Graehl, Kevin Knight, Daniel
This work has been partially supported by the EC-          Marcu, Steve DeNeefe, Wei Wang, and Ignacio
                                                           Thayer. 2006. Scalable inference and training
funded projects ModernMT (H2020 grant agree-               of context-rich syntactic translation models. In
ment no. 645487) and QT21 (H2020 grant agree-              Proceedings of the 21st International Conference
ment no. 645452). We gratefully acknowledge the            on Computational Linguistics and the 44th annual
support of NVIDIA Corporation with the donation            meeting of the Association for Computational Lin-
                                                           guistics, pages 961–968. Association for Computa-
of the Titan Xp GPUs used for this research.
                                                           tional Linguistics.

                                                         Thanh-Le Ha, Jan Niehues, and Alexander Waibel.
References                                                 2016. Toward multilingual neural machine trans-
                                                           lation with universal encoder and decoder. arXiv
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
                                                           preprint arXiv:1611.04798.
  gio. 2014. Neural machine translation by jointly
  learning to align and translate. arXiv preprint
  arXiv:1409.0473.                                       Sepp Hochreiter and Jürgen Schmidhuber. 1997.
                                                           Long short-term memory. Neural computation,
Ondrej Bojar, Rajen Chatterjee, Christian Federmann,       9(8):1735–1780.
  Yvette Graham, Barry Haddow, Matthias Huck, An-
  tonio Jimeno Yepes, Philipp Koehn, Varvara Lo-         Melvin Johnson, Mike Schuster, Quoc V Le, Maxim
  gacheva, Christof Monz, et al. 2016. Findings of the    Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
  2016 conference on machine translation (wmt16).         Fernanda Viégas, Martin Wattenberg, Greg Corrado,
  In Proceedings of the First Conference on Machine       et al. 2016. Google’s multilingual neural machine
  Translation (WMT), volume 2, pages 131–198.             translation system: Enabling zero-shot translation.
                                                          arXiv preprint arXiv:1611.04558.
Mauro Cettolo, Christian Girardi, and Marcello Fed-
 erico. 2012. Wit3 : Web inventory of transcribed        Philipp Koehn and Rebecca Knowles. 2017. Six
 and translated talks. In Proceedings of the 16th Con-     challenges for neural machine translation. arXiv
 ference of the European Association for Machine           preprint arXiv:1706.03872.
 Translation (EAMT), pages 261–268, Trento, Italy,
 May.                                                    Jason Lee, Kyunghyun Cho, and Thomas Hofmann.
                                                            2016. Fully character-level neural machine trans-
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa        lation without explicit segmentation. arXiv preprint
 Bentivogli, Roldano Cattoni, and Marcello Federico.        arXiv:1610.03017.
Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol
  Vinyals, and Lukasz Kaiser. 2015. Multi-task
  sequence to sequence learning. arXiv preprint
  arXiv:1511.06114.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  Jing Zhu. 2002. Bleu: a method for automatic
  evaluation of machine translation. In Proceedings of
  the 40th annual meeting on association for compu-
  tational linguistics, pages 311–318. Association for
  Computational Linguistics.
Michael Paul, Marcello Federico, and Sebastian Stüker.
  2010. Overview of the iwslt 2010 evaluation cam-
  paign. In International Workshop on Spoken Lan-
  guage Translation (IWSLT) 2010.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
  2015. Neural machine translation of rare words with
  subword units. arXiv preprint arXiv:1508.07909.
Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-
  dra Birch, Barry Haddow, Julian Hitschler, Marcin
  Junczys-Dowmunt, Samuel Läubli, Antonio Vale-
  rio Miceli Barone, Jozef Mokry, et al. 2017. Nema-
  tus: a toolkit for neural machine translation. arXiv
  preprint arXiv:1703.04357.
Odlin. Terence. 1989. Language transfer-cross-
  linguistic influence in language learning. Cam-
  bridge University Press. Cambridge Books Online.,
  page 222, June.
Barret Zoph and Kevin Knight.         2016.   Multi-
  source neural translation.          arXiv preprint
  arXiv:1601.00710.
Barret Zoph, Deniz Yuret, Jonathan May, and Kevin
  Knight.     2016.    Transfer learning for low-
  resource neural machine translation. arXiv preprint
  arXiv:1604.02201.