Multilingual Neural Machine Translation for Low Resource Languages Surafel M. Lakew Mattia A. Di Gangi Marcello Federico University of Trento, Italy University of Trento, Italy Fondazione Bruno Kessler Fondazione Bruno Kessler Fondazione Bruno Kessler via Sommarive, 18, Trento, Italy {lakew,digangi,federico}@fbk.eu Abstract learning richiedono grandi quantit di dati per imparare una mappa efficace tra le Neural Machine Translation (NMT) has due lingue. In questo lavoro mostriamo been shown to be more effective in trans- come un modello NMT multilingua può lation tasks compared to the Phrase-Based aiutare ad affrontare i problemi legati Statistical Machine Translation (PBMT). alla traduzione di LRL. La NMT multilin- However, NMT systems are limited in gua costringe la rappresentrazione delle translating low-resource languages (LRL), parole e dei segmenti di parole in uno due to the fact that neural methods require spazio semantico condiviso tra multiple a large amount of parallel data to learn ef- lingue. Questo consente al modello di fective mappings between languages. In usare un trasferimento di parametri pos- this work we show how so-called multi- itivo tra le lingue coinvolte, senza cam- lingual NMT can help to tackle the chal- biare l’architettura NMT encoder-decoder lenges associated with LRL translation. basata sull’attention e il modo di adde- Multilingual NMT forces words and sub- stramento. Abbiamo eseguito esperimenti words representation in a shared semantic preliminari con tre lingue (inglese, ital- space across multiple languages. This al- iano e rumeno), coprendo sei direzioni lows the model to utilize a positive param- di traduzione e mostriamo che per tutte eter transfer between different languages, le direzioni disponibili l’approccio mul- without changing the standard attention- tilingua, cioè un solo sistema che copre based encoder-decoder architecture and tutte le direzioni è confrontabile o persino training modality. We run preliminary ex- migliore dei singolo sistemi bilingue. In- periments with three languages (English, oltre, il nostro approccio ottiene risultati Italian, Romanian) covering six translation competitivi anche per coppie di lingue non directions and show that for all available viste durante il trainig, facendo uso di directions the multilingual approach, i.e. traduzioni con pivot. just one system covering all directions is comparable or even outperforms the single bilingual systems. Finally, our approach 1 Introduction achieve competitive results also for lan- guage pairs not seen at training time using Neural machine translation (NMT) has recently a pivoting (x-step) translation. shown its effectiveness by delivering the best performance in various evaluation campaigns Italiano. La traduzione automatica con (IWSLT 2016 (Cettolo et al., 2016), WMT reti neurali (neural machine translation, 2016 (Bojar et al., 2016)). Unlike rule-based NMT) ha dimostrato di essere più efficace or phrase-based MT, the end-to-end learning ap- in molti compiti di traduzione rispetto proach of NMT models the mapping from source a quella basata su frasi (phrase-based to target language directly through a posterior machine translation, PBMT). Tuttavia, i probability. The basic component of an NMT sys- sistemi NMT sono limitati nel tradurre tem include an encoder, a decoder and an atten- lingue con basse risorse (LRL). Questo tion mechanism (Bahdanau et al., 2014). Despite è dovuto al fatto che i metodi di deep the continuous improvement in performance and translation quality, NMT models are highly de- computes a contextual representation of the source pendent on the availability of large parallel data, words using an RNN implemented with an LSTM which in practice can only be acquired for a very (Hochreiter and Schmidhuber, 1997) or GRU (Cho limited number of language pairs. For this reason, et al., 2014): building effective NMT systems for low-resourced languages becomes a primary challenge (Koehn ht = g(xt , ht−1 ) t = 1, ..., m and Knowles, 2017). Recently, (Zoph et al., where xt is the embedding for the word at time 2016) showed how a standard string-to-tree statis- step t and m is the length of the source sentence. tical MT system (Galley et al., 2006) can effec- The decoder receives as input the embedding of tively outperform NMT methods for low-resource the target word at the previous decoding time step, languages, such as Hausa, Uzbek, and Urdu. In and computes through a RNN a new representa- this work, we focus on a so-called multilingual tion of the current translation, given the represen- NMT (Johnson et al., 2016; Ha et al., 2016), tation in the previous step, and a relevant source which considers the use of NMT to target many- context computed by the attention model. At to-many language translation. Our motivation each time step, the attention computes normalized is that intensive cross-lingual transfer (Terence, weights for the source word positions according 1989) via parameter sharing should ideally help in to the hidden state of the decoder, which are then the case of similar languages and sparse training used to compute the source context as a weighted data. Hence, in this work we investigate multilin- sum of all the encoder hidden states. There are gual NMT across Italian, Romanian, and English, several strategies to implement a decoder but all and simulate low-resource conditions by limiting of them end up computing the conditional proba- the amount of parallel data. bility of the next target word depending on the pre- Our approach showed a BLEU increase in var- viously translated words and the source sentence: ious language directions, in a low-resource set- ting. To compare a single language pair NMT p(ei = k|e