Multi-source Transformer for Automatic Post-Editing Amirhossein Tebbifakhr1,2 , Ruchit Agrawal1,2 , Matteo Negri1 , Marco Turchi1 1 Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento - Italy 2 University of Trento, Italy {atebbifakhr,ragrawal,negri,turchi}@fbk.eu Abstract 1 Introduction English. Recent approaches to the Au- Automatic post-editing (APE) (Simard et al., tomatic Post-editing (APE) of Machine 2007b; Simard et al., 2007a; Simard et al., 2009) Translation (MT) have shown that best re- is the task of fixing errors in a machine-translated sults are obtained by neural multi-source text by learning from human corrections. It has models that correct the raw MT output by shown to be useful for various tasks like domain also considering information from the cor- adaptation (Isabelle et al., 2007) and for reducing responding source sentence. In this pa- time, effort and the overall costs of human transla- per, we pursue this objective by exploiting, tion in industry environments (Aziz et al., 2012). for the first time in APE, the Transformer Recent approaches to the task have shown that architecture. Our approach is much sim- better results can be obtained by neural multi- pler than the best current solutions, which source models that perform the automatic correc- are based on ensembling multiple models tion of raw MT output by also considering infor- and adding a final hypothesis re-ranking mation from the corresponding source sentence step. We evaluate our Transformer-based (Chatterjee et al., 2015; Pal et al., 2016). However, system on the English-German data re- state-of-the-art APE solutions employ pipelined leased for the WMT 2017 APE shared architectures (Bojar et al., 2017) whose complex- task, achieving results that outperform the ity reduces their usability in industrial settings. In- state of the art with a simpler architecture deed, current top systems typically rely on ensem- suitable for industrial applications. bling multiple recurrent neural networks (RNNs) Italiano. Gli approcci più efficaci alla and performing a final re-ranking step (Chatterjee correzione automatica di errori nella et al., 2017) to select the most promising correc- traduzione automatica (Automatic Post- tion hypothesis. Though competitive, such archi- editing – APE) attualmente si basano su tectures require training and maintaining multiple modelli neurali multi-source, capaci cioè components, involving costs that reduce their ap- di sfruttare informazione proveniente sia peal from the industry perspective. dalla frase da correggere che dalla frase In this paper, we address this issue, aiming at nella lingua sorgente. Seguendo tale ap- a method that is suitable for industry applications, proccio, in questo articolo applichiamo in which a single trainable network is preferable to per la prima volta l’architettura Trans- multiple, independently-trained components. Our former, ottenendo un sistema notevol- main contributions are the following: mente meno complesso rispetto a quelli • We introduce, for the first time in APE, a proposti fino ad ora (i migliori dei quali, Transformer-based architecture (Vaswani et basati sulla combinazione di più mod- al., 2017) that considerably reduces system elli). Attraverso esperimenti su dati complexity (thus being efficient and easy to Inglese-Tedesco rilasciati per l’APE task train and maintain); a WMT 2017, dimostriamo che, oltre a • In doing so, we modify the Transformer ar- tale guadagno in termini di semplicità, il chitecture to incorporate multiple encoders, metodo proposto ottiene risultati superiori thereby considering also source-side infor- allo stato dell’arte. mation to increase correction accuracy; • On shared data sets, we report evaluation NMT by completely avoiding both recurrence and results that are comparable (less than 0.5 convolution. Since the model does not leverage BLEU score points in the worst case) to those the order of words, it adds positional encoding of computationally-intensive state-of-the-art to the word embeddings to enable the model to systems based on model ensembling and hy- capture the order. In Transformer, the attention pothesis reranking. employed is a multi-headed self-attention, which is a mapping from (query, key, value) tuples to 2 Methodology an output vector. The self-attention is defined as follows: In this Section we shortly overview our ap- proach, by first motivating the use of Transformer SA(Q, K, V ) = sof tmax(QK T / dk )V p (Vaswani et al., 2017) and then by introducing our (2) modifications to deploy it for APE. Most of the competitive neural approaches where Q is the query matrix, K is the key matrix in machine translation employ deep recurrent and V is the value matrix, dk is the dimensionality networks (Sutskever et al., 2014; Bahdanau et of the queries and keys, and SA is the computed al., 2015). These approaches follow the encoder- self-attention. decoder architecture. A sequence of words [x1 , The multi-head attention is computed as fol- x2 , .. , xn ] is given to an encoder, which maps lows: it to a sequence of continuous representations, i.e. the hidden state of the encoder. At each time M H(Q, K, V ) = Concat(head1 , ..., headh )W O step, based on these continuous representations (3) and the generated word in the previous time where MH is the multi-head attention, h is the step, a decoder generates the next word. This number of attention layers (also called “heads”), process continues until the decoder generates headi is the self-attention computed over the ith the end-of-the-sentence word. More formally, attention layer and W O is the parameter matrix of the decoder predicts the next word yt , given the dimension hdv *dmodel . The encoder layers con- context vector c and the previously predicted sist of a multi-head self-attention, followed by a words y1 to yt−1 by defining a probability over position-wise feed forward network. In the self- the translation y as follows: attention, the queries, keys and values matrices T come from the previous layer. In the decoder, the layers have an extra encoder-decoder multi-head Y p(y) = p(yt |[y1 , .., yt−1 ], c) (1) t=1 attention after the multi-head self-attention, where The context vector c is a weighted sum com- the key and value matrices come from the encoder puted over the hidden states of the encoder. The and the query matrix comes from the previous weights used to compute the context vector are layer in the decoder. Also, inputs to the multi-head obtained by a network called attention model that self-attention in the decoder are masked in order to finds an alignment between the target and source not attend to the next positions. Finally, a softmax words (Bahdanau et al., 2015). From an efficiency normalization is applied to the output of the last standpoint, a major drawback of these approaches layer in the decoder to generate a probability dis- is that, at each time step, the decoder needs the tribution over the target vocabulary. hidden state of the previous time step, thus hin- In order to encode the source sentence in addi- dering parallelization. Other approaches have tion to the MT output, we employ the multi-source been proposed to avoid this sequential dependency method (Zoph and Knight, 2016), wherein the (e.g. using convolution as a main building blocks) model is comprised of separated encoders (with and make parallelization possible (Gehring et al., a different set parameters) to capture the source 2017; Kalchbrenner et al., 2016). Although they sentence and the MT output respectively. For the can avoid the recurrence, they are not able to prop- Transformer, we concatenate the two encoder out- erly learn the long term dependencies between puts and that is passed as the key in the atten- words. tion. This helps for a better representation, in turn The Transformer architecture, introduced in leading to more effective attention during decod- (Vaswani et al., 2017), set a new state-of-the-art in ing time. train development test synthetic 4M synthetic 500K in-domain in-domain in-domain 2016 in-domain 2017 4,391,180 526,368 23,000 1,000 2,000 2,000 Table 1: Statistics for synthetic and in-domain datasets 3 Experiment Setup output of a “do-nothing” APE model that leaves all the original MT outputs untouched, and ii) Ens8 + 3.1 Data RR: the winning system at the WMT 2017 APE For the sake of a fair comparison with the best shared task (Chatterjee et al., 2017). It comprises performing system at the WMT 2017 APE shared 4 different models based on RNN architecture: task (Chatterjee et al., 2017), we use the same • SRC PE a single-source model that exploits training, development and test WMT datasets. The only the source sentence to generate post- training data consists of three different corpora. edits; One of them is released by the task organizers and contains 23K triplets from the Information • MT PE a single-source model that only ex- Technology domain. The other two are synthetic ploits the machine translation output to gen- data created by (Junczys-Dowmunt and Grund- erate post-edits; kiewicz, 2017). They respectively contain ∼4M • MT+SRC PE a multi-source model that ex- and ∼500K English-German triplets generated by ploits both the source sentence and the MT a round-trip translation process. By using two output to generate post-edits; phrase-based translation models, German-English and English-German, German monolingual data • MT+SRC PE TSL another multi-source are first translated into English and then the ob- model with a task-specific loss function in tained outputs are translated back into German. order to avoid over correction. The original German monolingual data are con- For mixing the context vectors of the two en- sidered as post-edits, the English translated data coders, Ens8 + RR uses a merging layer. This are considered as source sentences, and the Ger- layer applies a linear transformation over the con- man back-translated data are considered as ma- catenation of the two context vectors. Chatterjee chine translation outputs. The development set is et al. (2017) compared the performance of these the one released for WMT 2017 APE shared task, 4 models on the development set, and reported which contains 1K in-domain triplets. We evalu- that MT+SRC PE outperforms the other models. ate our model using the two test sets released for They also ensembled the two best models for each WMT 2016 and 2017 APE shared tasks, each con- configuration to leverage all the models in a sin- taining 2K in-domain triplets. Table 1 summa- gle decoder. On top of that, they also trained a rizes the statistics of the datasets. To avoid un- re-ranker (Pal et al., 2017) to re-order the n-best known words and to keep under control the vocab- hypotheses generated by this ensemble. In order ulary size, we apply byte pair encoding (Sennrich to train the re-ranker, they used a set of features et al., 2016) to all the data. which are mainly based on edit distance. This set 3.2 Evaluation Metrics includes number of insertions, deletions, substitu- tions, shifts, and length ratios between MT out- For evaluation, we use the two official metrics of put and APE hypotheses. It also includes preci- the WMT APE task: i) TER (Snover et al., 2006) sion and recall of the APE hypotheses. In Section which is based on edit distance and ii) BLEU, 4, we compare our model with the SRC+MT PE which is the geometric mean of n-gram precision model and the ensembled model plus re-ranker (Papineni et al., 2002). They are both applied on (Ens8+RR). We train these models with the same tokenized and true-cased data. settings reported in (Chatterjee et al., 2017). 3.3 Term of Comparison 3.4 System Setting We compare the performance of our Transformer We initially train a generic Transformer model by model with two baselines: i) MT Baseline: the using the ∼4M synthetic data. Then, we fine-tune Systems TER BLEU Test2016 Test2017 Systems Baseline 24.81 62.92 TER BLEU TER BLEU SRC+MT PE 19.77 70.72 MT Baseline 24.76 62.11 24.48 62.49 Ens8 + RR 19.22 71.89 Ens8 + RR 19.32 70.88 19.60 70.07 Transformer 19.17 71.58 Transformer 19.25 70.70 19.81 69.64 Avg4 18.77 72.04 Avg4 18.79 71.48 19.54 70.09 Table 2: performance of APE systems on 2017 de- Table 3: performance of APE systems on 2016 and velopment dataset (en-de) 2017 test datasets (en-de) the resulting model on the union of the ∼500K and aging the model’s checkpoints weights is advan- the in-domain training data (multiplied 20). Our tageous. Moreover, we are not loosing our sim- Transformer model uses word embedding with plicity in comparison with ensembling, since we 512 dimensions. The decoder and each encoder are choosing the model’s checkpoints in a single have 4 attention layers with 512 units, 4 paral- training round and this does not require training lel attention heads, and a feed-forward layer with several models and architectures. In order to con- 1,024 dimensions. The network parameters are firm our observation on the development set, we updated using Lazy Adam optimizer (Kingma also evaluated our model in compare to Ens8+RR and Ba, 2014), with mini-batch size of 8,192 to- on the two test sets. Table 3 shows the results kens for generic training and 2,048 tokens for fine- obtained on the two test sets, which confirm our tuning. The learning rate is varied using a warm- observations on development data. The averaged up strategy (Vaswani et al., 2017) with warm-up model has the best performance over the RNN steps equal to 8,000. During training, the drop- systems and single Transformer. It significantly out rate and the label smoothing value are set to outperforms Ens8+RR on 2016 test data, while a 0.1. During decoding, we employ beam search marginal improvements is obtained on the 2017 with beam width equal to 10. For both the generic test set. To conclude, our results confirm the trend and fine-tuning steps, we continue the training seen in Machine Translation, where Transformer for 10 epochs and choose the best model check- outperforms RNN-based systems on different lan- points based on their performance on the devel- guage pairs and datasets using a simpler architec- opment set. For our implementation, we use the ture. Beside this, our extension targeting the in- OpenNMT-tf toolkit (Klein et al., 2017). clusion of source-side information sets a new state of the art in APE. 4 Results and Discussion Table 2 shows the results obtained by different 5 Conclusion models on the development set. Together with our simple Transformer model (Transformer), it We developed and used a multi-source Trans- also reports the performance of averaging the former architecture for neural Automatic Post- weights of the 4 best model checkpoints (Avg4). editing. In contrast to the current state-of-the-art Our Transformer model performs better than the systems for APE, which are based on RNN archi- SRC+MT PE model (-0.6 TER and +0.86 BLEU) tectures that typically comprise multiple compo- showing that using the Transformer architecture nents, we used a single model which can be trained instead of RNN is helpful. Also, our Transformer in an end-to-end fashion. This solution is particu- model outperforms Ens8+RR in terms of TER, larly suitable for industrial sectors, where main- with only a small loss in terms of BLEU. This taining different components is costly and inef- highlights that our simple model can achieve com- ficient. Our experiments show that our simplest parable results with the best performing systems, model has comparable results to the best RNN sys- but using less complex architecture. By averag- tems, while the best one can even perform slightly ing different Transformer checkpoints, our model better. This sets the new state of the art in APE outperforms Ens8+RR by -0.45 TER and +0.15 and confirms the superiority of Transformer in BLEU. This gain confirms the results reported by sequence-to-sequence learning tasks. Popel and Bojar (2018), who showed that aver- References Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. Opennmt: Wilker Aziz, Sheila Castilho, and Lucia Specia. 2012. Open-source toolkit for neural machine translation. Pet: a tool for post-editing and assessing machine In Proceedings of ACL 2017, System Demonstra- translation. In LREC, pages 3982–3987. tions, pages 67–72. Association for Computational Linguistics. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly Santanu Pal, Sudip Kumar Naskar, Mihaela Vela, and learning to align and translate. In ICLR. Josef van Genabith. 2016. A neural network based approach to automatic post-editing. In Proceed- Ondřej Bojar, Rajen Chatterjee, Christian Federmann, ings of the 54th Annual Meeting of the Association Yvette Graham, Barry Haddow, Shujian Huang, for Computational Linguistics (Volume 2: Short Pa- Matthias Huck, Philipp Koehn, Qun Liu, Varvara pers), volume 2, pages 281–286. Logacheva, Christof Monz, Matteo Negri, Matt Post, Raphael Rubino, Lucia Specia, and Marco Santanu Pal, Sudip Kumar Naskar, Mihaela Vela, Qun Turchi. 2017. Findings of the 2017 conference Liu, and Josef van Genabith. 2017. Neural auto- on machine translation (wmt17). In Proceedings matic post-editing using prior alignment and rerank- of the Second Conference on Machine Translation, ing. In Proceedings of the 15th Conference of the pages 169–214. Association for Computational Lin- European Chapter of the Association for Computa- guistics. tional Linguistics: Volume 2, Short Papers, pages Rajen Chatterjee, Marion Weller, Matteo Negri, and 349–355. Association for Computational Linguis- Marco Turchi. 2015. Exploring the planet of the tics. apes: a comparative study of state-of-the-art meth- ods for mt automatic post-editing. In Proceedings Kishore Papineni, Salim Roukos, Todd Ward, and Wei- of the 53rd Annual Meeting of the Association for Jing Zhu. 2002. Bleu: A method for automatic Computational Linguistics and the 7th International evaluation of machine translation. In Proceedings Joint Conference on Natural Language Processing of the 40th Annual Meeting on Association for Com- (Volume 2: Short Papers), volume 2, pages 156–161. putational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA. Association for Computa- Rajen Chatterjee, M. Amin Farajian, Matteo Negri, tional Linguistics. Marco Turchi, Ankit Srivastava, and Santanu Pal. 2017. Multi-source neural automatic post-editing: Martin Popel and Ondřej Bojar. 2018. Training tips Fbk’s participation in the wmt 2017 ape shared task. for the transformer model. The Prague Bulletin of In Proceedings of the Second Conference on Ma- Mathematical Linguistics, 110(1):43–70. chine Translation, pages 630–638. Association for Computational Linguistics. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words Jonas Gehring, Michael Auli, David Grangier, and with subword units. In Proceedings of the 54th Yann Dauphin. 2017. A convolutional encoder Annual Meeting of the Association for Computa- model for neural machine translation. In Proceed- tional Linguistics (Volume 1: Long Papers), pages ings of the 55th Annual Meeting of the Associa- 1715–1725, Berlin, Germany, August. Association tion for Computational Linguistics (Volume 1: Long for Computational Linguistics. Papers), pages 123–135. Association for Computa- tional Linguistics. Michel Simard, Cyril Goutte, and Pierre Isabelle. 2007a. Statistical phrase-based post-editing. Pierre Isabelle, Cyril Goutte, and Michel Simard. 2007. Domain adaptation of mt systems through au- Michel Simard, Nicola Ueffing, Pierre Isabelle, and tomatic post-editing. Roland Kuhn. 2007b. Rule-based translation with statistical phrase-based post-editing. In Proceed- Marcin Junczys-Dowmunt and Roman Grundkiewicz. ings of the Second Workshop on Statistical Machine 2017. The amu-uedin submission to the wmt 2017 Translation, pages 203–206. Association for Com- shared task on automatic post-editing. In Proceed- putational Linguistics. ings of the Second Conference on Machine Transla- tion, pages 639–646. Association for Computational Michel Simard, Pierre Isabelle, George Foster, Cyril Linguistics. Goutte, and Roland Kuhn. 2009. Means and method for automatic post-editing of translations, Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, December 31. US Patent App. 12/448,859. Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016. Neural machine translation in Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- linear time. arXiv preprint arXiv:1610.10099. nea Micciulla, and John Makhoul. 2006. A Study of Translation Edit Rate with Targeted Human An- Diederik P Kingma and Jimmy Ba. 2014. Adam: A notation. In Proceedings of Association for Machine method for stochastic optimization. arXiv preprint Translation in the Americas, pages 223–231, Cam- arXiv:1412.6980. bridge, Massachusetts, USA. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural net- works. In Proceedings of the 27th International Conference on Neural Information Processing Sys- tems - Volume 2, NIPS’14, pages 3104–3112, Cam- bridge, MA, USA. MIT Press. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- nett, editors, Advances in Neural Information Pro- cessing Systems 30, pages 5998–6008. Curran As- sociates, Inc. Barret Zoph and Kevin Knight. 2016. Multi- source neural translation. arXiv preprint arXiv:1601.00710.