Multi-source Transformer for Automatic Post-Editing

  Amirhossein Tebbifakhr1,2 , Ruchit Agrawal1,2 , Matteo Negri1 , Marco Turchi1
       1
         Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento - Italy
                          2
                             University of Trento, Italy
          {atebbifakhr,ragrawal,negri,turchi}@fbk.eu


                Abstract                       1    Introduction
English. Recent approaches to the Au-          Automatic post-editing (APE) (Simard et al.,
tomatic Post-editing (APE) of Machine          2007b; Simard et al., 2007a; Simard et al., 2009)
Translation (MT) have shown that best re-      is the task of fixing errors in a machine-translated
sults are obtained by neural multi-source      text by learning from human corrections. It has
models that correct the raw MT output by       shown to be useful for various tasks like domain
also considering information from the cor-     adaptation (Isabelle et al., 2007) and for reducing
responding source sentence. In this pa-        time, effort and the overall costs of human transla-
per, we pursue this objective by exploiting,   tion in industry environments (Aziz et al., 2012).
for the first time in APE, the Transformer        Recent approaches to the task have shown that
architecture. Our approach is much sim-        better results can be obtained by neural multi-
pler than the best current solutions, which    source models that perform the automatic correc-
are based on ensembling multiple models        tion of raw MT output by also considering infor-
and adding a final hypothesis re-ranking       mation from the corresponding source sentence
step. We evaluate our Transformer-based        (Chatterjee et al., 2015; Pal et al., 2016). However,
system on the English-German data re-          state-of-the-art APE solutions employ pipelined
leased for the WMT 2017 APE shared             architectures (Bojar et al., 2017) whose complex-
task, achieving results that outperform the    ity reduces their usability in industrial settings. In-
state of the art with a simpler architecture   deed, current top systems typically rely on ensem-
suitable for industrial applications.          bling multiple recurrent neural networks (RNNs)
Italiano. Gli approcci più efficaci alla      and performing a final re-ranking step (Chatterjee
correzione automatica di errori nella          et al., 2017) to select the most promising correc-
traduzione automatica (Automatic Post-         tion hypothesis. Though competitive, such archi-
editing – APE) attualmente si basano su        tectures require training and maintaining multiple
modelli neurali multi-source, capaci cioè     components, involving costs that reduce their ap-
di sfruttare informazione proveniente sia      peal from the industry perspective.
dalla frase da correggere che dalla frase         In this paper, we address this issue, aiming at
nella lingua sorgente. Seguendo tale ap-       a method that is suitable for industry applications,
proccio, in questo articolo applichiamo        in which a single trainable network is preferable to
per la prima volta l’architettura Trans-       multiple, independently-trained components. Our
former, ottenendo un sistema notevol-          main contributions are the following:
mente meno complesso rispetto a quelli             • We introduce, for the first time in APE, a
proposti fino ad ora (i migliori dei quali,          Transformer-based architecture (Vaswani et
basati sulla combinazione di più mod-               al., 2017) that considerably reduces system
elli).    Attraverso esperimenti su dati             complexity (thus being efficient and easy to
Inglese-Tedesco rilasciati per l’APE task            train and maintain);
a WMT 2017, dimostriamo che, oltre a
                                                   • In doing so, we modify the Transformer ar-
tale guadagno in termini di semplicità, il
                                                     chitecture to incorporate multiple encoders,
metodo proposto ottiene risultati superiori
                                                     thereby considering also source-side infor-
allo stato dell’arte.
                                                     mation to increase correction accuracy;
    • On shared data sets, we report evaluation             NMT by completely avoiding both recurrence and
      results that are comparable (less than 0.5            convolution. Since the model does not leverage
      BLEU score points in the worst case) to those         the order of words, it adds positional encoding
      of computationally-intensive state-of-the-art         to the word embeddings to enable the model to
      systems based on model ensembling and hy-             capture the order. In Transformer, the attention
      pothesis reranking.                                   employed is a multi-headed self-attention, which
                                                            is a mapping from (query, key, value) tuples to
2    Methodology                                            an output vector. The self-attention is defined as
                                                            follows:
In this Section we shortly overview our ap-
proach, by first motivating the use of Transformer
                                                              SA(Q, K, V ) = sof tmax(QK T / dk )V
                                                                                                  p
(Vaswani et al., 2017) and then by introducing our                                                           (2)
modifications to deploy it for APE.
   Most of the competitive neural approaches                where Q is the query matrix, K is the key matrix
in machine translation employ deep recurrent                and V is the value matrix, dk is the dimensionality
networks (Sutskever et al., 2014; Bahdanau et               of the queries and keys, and SA is the computed
al., 2015). These approaches follow the encoder-            self-attention.
decoder architecture. A sequence of words [x1 ,                The multi-head attention is computed as fol-
x2 , .. , xn ] is given to an encoder, which maps           lows:
it to a sequence of continuous representations,
i.e. the hidden state of the encoder. At each time          M H(Q, K, V ) = Concat(head1 , ..., headh )W O
step, based on these continuous representations                                                               (3)
and the generated word in the previous time                 where MH is the multi-head attention, h is the
step, a decoder generates the next word. This               number of attention layers (also called “heads”),
process continues until the decoder generates               headi is the self-attention computed over the ith
the end-of-the-sentence word. More formally,                attention layer and W O is the parameter matrix of
the decoder predicts the next word yt , given the           dimension hdv *dmodel . The encoder layers con-
context vector c and the previously predicted               sist of a multi-head self-attention, followed by a
words y1 to yt−1 by defining a probability over             position-wise feed forward network. In the self-
the translation y as follows:                               attention, the queries, keys and values matrices
                   T
                                                            come from the previous layer. In the decoder, the
                                                            layers have an extra encoder-decoder multi-head
                   Y
          p(y) =         p(yt |[y1 , .., yt−1 ], c)   (1)
                   t=1                                      attention after the multi-head self-attention, where
The context vector c is a weighted sum com-                 the key and value matrices come from the encoder
puted over the hidden states of the encoder. The            and the query matrix comes from the previous
weights used to compute the context vector are              layer in the decoder. Also, inputs to the multi-head
obtained by a network called attention model that           self-attention in the decoder are masked in order to
finds an alignment between the target and source            not attend to the next positions. Finally, a softmax
words (Bahdanau et al., 2015). From an efficiency           normalization is applied to the output of the last
standpoint, a major drawback of these approaches            layer in the decoder to generate a probability dis-
is that, at each time step, the decoder needs the           tribution over the target vocabulary.
hidden state of the previous time step, thus hin-              In order to encode the source sentence in addi-
dering parallelization. Other approaches have               tion to the MT output, we employ the multi-source
been proposed to avoid this sequential dependency           method (Zoph and Knight, 2016), wherein the
(e.g. using convolution as a main building blocks)          model is comprised of separated encoders (with
and make parallelization possible (Gehring et al.,          a different set parameters) to capture the source
2017; Kalchbrenner et al., 2016). Although they             sentence and the MT output respectively. For the
can avoid the recurrence, they are not able to prop-        Transformer, we concatenate the two encoder out-
erly learn the long term dependencies between               puts and that is passed as the key in the atten-
words.                                                      tion. This helps for a better representation, in turn
   The Transformer architecture, introduced in              leading to more effective attention during decod-
(Vaswani et al., 2017), set a new state-of-the-art in       ing time.
                       train                     development                   test
    synthetic 4M   synthetic 500K   in-domain     in-domain      in-domain 2016 in-domain 2017
     4,391,180        526,368         23,000        1,000             2,000          2,000

                         Table 1: Statistics for synthetic and in-domain datasets


3     Experiment Setup                                output of a “do-nothing” APE model that leaves all
                                                      the original MT outputs untouched, and ii) Ens8 +
3.1    Data                                           RR: the winning system at the WMT 2017 APE
For the sake of a fair comparison with the best       shared task (Chatterjee et al., 2017). It comprises
performing system at the WMT 2017 APE shared          4 different models based on RNN architecture:
task (Chatterjee et al., 2017), we use the same          • SRC PE a single-source model that exploits
training, development and test WMT datasets. The           only the source sentence to generate post-
training data consists of three different corpora.         edits;
One of them is released by the task organizers
and contains 23K triplets from the Information           • MT PE a single-source model that only ex-
Technology domain. The other two are synthetic             ploits the machine translation output to gen-
data created by (Junczys-Dowmunt and Grund-                erate post-edits;
kiewicz, 2017). They respectively contain ∼4M
                                                         • MT+SRC PE a multi-source model that ex-
and ∼500K English-German triplets generated by
                                                           ploits both the source sentence and the MT
a round-trip translation process. By using two
                                                           output to generate post-edits;
phrase-based translation models, German-English
and English-German, German monolingual data              • MT+SRC PE TSL another multi-source
are first translated into English and then the ob-         model with a task-specific loss function in
tained outputs are translated back into German.            order to avoid over correction.
The original German monolingual data are con-
                                                      For mixing the context vectors of the two en-
sidered as post-edits, the English translated data
                                                      coders, Ens8 + RR uses a merging layer. This
are considered as source sentences, and the Ger-
                                                      layer applies a linear transformation over the con-
man back-translated data are considered as ma-
                                                      catenation of the two context vectors. Chatterjee
chine translation outputs. The development set is
                                                      et al. (2017) compared the performance of these
the one released for WMT 2017 APE shared task,
                                                      4 models on the development set, and reported
which contains 1K in-domain triplets. We evalu-
                                                      that MT+SRC PE outperforms the other models.
ate our model using the two test sets released for
                                                      They also ensembled the two best models for each
WMT 2016 and 2017 APE shared tasks, each con-
                                                      configuration to leverage all the models in a sin-
taining 2K in-domain triplets. Table 1 summa-
                                                      gle decoder. On top of that, they also trained a
rizes the statistics of the datasets. To avoid un-
                                                      re-ranker (Pal et al., 2017) to re-order the n-best
known words and to keep under control the vocab-
                                                      hypotheses generated by this ensemble. In order
ulary size, we apply byte pair encoding (Sennrich
                                                      to train the re-ranker, they used a set of features
et al., 2016) to all the data.
                                                      which are mainly based on edit distance. This set
3.2    Evaluation Metrics                             includes number of insertions, deletions, substitu-
                                                      tions, shifts, and length ratios between MT out-
For evaluation, we use the two official metrics of    put and APE hypotheses. It also includes preci-
the WMT APE task: i) TER (Snover et al., 2006)        sion and recall of the APE hypotheses. In Section
which is based on edit distance and ii) BLEU,         4, we compare our model with the SRC+MT PE
which is the geometric mean of n-gram precision       model and the ensembled model plus re-ranker
(Papineni et al., 2002). They are both applied on     (Ens8+RR). We train these models with the same
tokenized and true-cased data.                        settings reported in (Chatterjee et al., 2017).
3.3    Term of Comparison                             3.4   System Setting
We compare the performance of our Transformer         We initially train a generic Transformer model by
model with two baselines: i) MT Baseline: the         using the ∼4M synthetic data. Then, we fine-tune
         Systems          TER      BLEU                                    Test2016           Test2017
                                                        Systems
         Baseline         24.81    62.92                                TER BLEU           TER BLEU
         SRC+MT PE        19.77    70.72                MT Baseline     24.76 62.11        24.48 62.49
         Ens8 + RR        19.22    71.89                Ens8 + RR       19.32 70.88        19.60 70.07
         Transformer      19.17    71.58                Transformer     19.25 70.70        19.81 69.64
         Avg4             18.77    72.04                Avg4            18.79 71.48        19.54 70.09

Table 2: performance of APE systems on 2017 de-        Table 3: performance of APE systems on 2016 and
velopment dataset (en-de)                              2017 test datasets (en-de)


the resulting model on the union of the ∼500K and      aging the model’s checkpoints weights is advan-
the in-domain training data (multiplied 20). Our       tageous. Moreover, we are not loosing our sim-
Transformer model uses word embedding with             plicity in comparison with ensembling, since we
512 dimensions. The decoder and each encoder           are choosing the model’s checkpoints in a single
have 4 attention layers with 512 units, 4 paral-       training round and this does not require training
lel attention heads, and a feed-forward layer with     several models and architectures. In order to con-
1,024 dimensions. The network parameters are           firm our observation on the development set, we
updated using Lazy Adam optimizer (Kingma              also evaluated our model in compare to Ens8+RR
and Ba, 2014), with mini-batch size of 8,192 to-       on the two test sets. Table 3 shows the results
kens for generic training and 2,048 tokens for fine-   obtained on the two test sets, which confirm our
tuning. The learning rate is varied using a warm-      observations on development data. The averaged
up strategy (Vaswani et al., 2017) with warm-up        model has the best performance over the RNN
steps equal to 8,000. During training, the drop-       systems and single Transformer. It significantly
out rate and the label smoothing value are set to      outperforms Ens8+RR on 2016 test data, while a
0.1. During decoding, we employ beam search            marginal improvements is obtained on the 2017
with beam width equal to 10. For both the generic      test set. To conclude, our results confirm the trend
and fine-tuning steps, we continue the training        seen in Machine Translation, where Transformer
for 10 epochs and choose the best model check-         outperforms RNN-based systems on different lan-
points based on their performance on the devel-        guage pairs and datasets using a simpler architec-
opment set. For our implementation, we use the         ture. Beside this, our extension targeting the in-
OpenNMT-tf toolkit (Klein et al., 2017).               clusion of source-side information sets a new state
                                                       of the art in APE.
4   Results and Discussion
Table 2 shows the results obtained by different        5   Conclusion
models on the development set. Together with
our simple Transformer model (Transformer), it         We developed and used a multi-source Trans-
also reports the performance of averaging the          former architecture for neural Automatic Post-
weights of the 4 best model checkpoints (Avg4).        editing. In contrast to the current state-of-the-art
Our Transformer model performs better than the         systems for APE, which are based on RNN archi-
SRC+MT PE model (-0.6 TER and +0.86 BLEU)              tectures that typically comprise multiple compo-
showing that using the Transformer architecture        nents, we used a single model which can be trained
instead of RNN is helpful. Also, our Transformer       in an end-to-end fashion. This solution is particu-
model outperforms Ens8+RR in terms of TER,             larly suitable for industrial sectors, where main-
with only a small loss in terms of BLEU. This          taining different components is costly and inef-
highlights that our simple model can achieve com-      ficient. Our experiments show that our simplest
parable results with the best performing systems,      model has comparable results to the best RNN sys-
but using less complex architecture. By averag-        tems, while the best one can even perform slightly
ing different Transformer checkpoints, our model       better. This sets the new state of the art in APE
outperforms Ens8+RR by -0.45 TER and +0.15             and confirms the superiority of Transformer in
BLEU. This gain confirms the results reported by       sequence-to-sequence learning tasks.
Popel and Bojar (2018), who showed that aver-
References                                               Guillaume Klein, Yoon Kim, Yuntian Deng, Jean
                                                           Senellart, and Alexander Rush. 2017. Opennmt:
Wilker Aziz, Sheila Castilho, and Lucia Specia. 2012.      Open-source toolkit for neural machine translation.
  Pet: a tool for post-editing and assessing machine       In Proceedings of ACL 2017, System Demonstra-
  translation. In LREC, pages 3982–3987.                   tions, pages 67–72. Association for Computational
                                                           Linguistics.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
  gio. 2015. Neural machine translation by jointly       Santanu Pal, Sudip Kumar Naskar, Mihaela Vela, and
  learning to align and translate. In ICLR.                Josef van Genabith. 2016. A neural network based
                                                           approach to automatic post-editing. In Proceed-
Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
                                                           ings of the 54th Annual Meeting of the Association
  Yvette Graham, Barry Haddow, Shujian Huang,
                                                           for Computational Linguistics (Volume 2: Short Pa-
  Matthias Huck, Philipp Koehn, Qun Liu, Varvara
                                                           pers), volume 2, pages 281–286.
  Logacheva, Christof Monz, Matteo Negri, Matt
  Post, Raphael Rubino, Lucia Specia, and Marco
                                                         Santanu Pal, Sudip Kumar Naskar, Mihaela Vela, Qun
  Turchi. 2017. Findings of the 2017 conference
                                                           Liu, and Josef van Genabith. 2017. Neural auto-
  on machine translation (wmt17). In Proceedings
                                                           matic post-editing using prior alignment and rerank-
  of the Second Conference on Machine Translation,
                                                           ing. In Proceedings of the 15th Conference of the
  pages 169–214. Association for Computational Lin-
                                                           European Chapter of the Association for Computa-
  guistics.
                                                           tional Linguistics: Volume 2, Short Papers, pages
Rajen Chatterjee, Marion Weller, Matteo Negri, and         349–355. Association for Computational Linguis-
  Marco Turchi. 2015. Exploring the planet of the          tics.
  apes: a comparative study of state-of-the-art meth-
  ods for mt automatic post-editing. In Proceedings      Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  of the 53rd Annual Meeting of the Association for        Jing Zhu. 2002. Bleu: A method for automatic
  Computational Linguistics and the 7th International      evaluation of machine translation. In Proceedings
  Joint Conference on Natural Language Processing          of the 40th Annual Meeting on Association for Com-
  (Volume 2: Short Papers), volume 2, pages 156–161.       putational Linguistics, ACL ’02, pages 311–318,
                                                           Stroudsburg, PA, USA. Association for Computa-
Rajen Chatterjee, M. Amin Farajian, Matteo Negri,          tional Linguistics.
  Marco Turchi, Ankit Srivastava, and Santanu Pal.
  2017. Multi-source neural automatic post-editing:      Martin Popel and Ondřej Bojar. 2018. Training tips
  Fbk’s participation in the wmt 2017 ape shared task.    for the transformer model. The Prague Bulletin of
  In Proceedings of the Second Conference on Ma-          Mathematical Linguistics, 110(1):43–70.
  chine Translation, pages 630–638. Association for
  Computational Linguistics.                             Rico Sennrich, Barry Haddow, and Alexandra Birch.
                                                           2016. Neural machine translation of rare words
Jonas Gehring, Michael Auli, David Grangier, and           with subword units. In Proceedings of the 54th
  Yann Dauphin. 2017. A convolutional encoder              Annual Meeting of the Association for Computa-
  model for neural machine translation. In Proceed-        tional Linguistics (Volume 1: Long Papers), pages
  ings of the 55th Annual Meeting of the Associa-          1715–1725, Berlin, Germany, August. Association
  tion for Computational Linguistics (Volume 1: Long       for Computational Linguistics.
  Papers), pages 123–135. Association for Computa-
  tional Linguistics.                                    Michel Simard, Cyril Goutte, and Pierre Isabelle.
                                                           2007a. Statistical phrase-based post-editing.
Pierre Isabelle, Cyril Goutte, and Michel Simard.
   2007. Domain adaptation of mt systems through au-     Michel Simard, Nicola Ueffing, Pierre Isabelle, and
   tomatic post-editing.                                   Roland Kuhn. 2007b. Rule-based translation with
                                                           statistical phrase-based post-editing. In Proceed-
Marcin Junczys-Dowmunt and Roman Grundkiewicz.             ings of the Second Workshop on Statistical Machine
 2017. The amu-uedin submission to the wmt 2017            Translation, pages 203–206. Association for Com-
 shared task on automatic post-editing. In Proceed-        putational Linguistics.
 ings of the Second Conference on Machine Transla-
 tion, pages 639–646. Association for Computational      Michel Simard, Pierre Isabelle, George Foster, Cyril
 Linguistics.                                              Goutte, and Roland Kuhn. 2009. Means and
                                                           method for automatic post-editing of translations,
Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,          December 31. US Patent App. 12/448,859.
  Aaron van den Oord, Alex Graves, and Koray
  Kavukcuoglu. 2016. Neural machine translation in       Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
  linear time. arXiv preprint arXiv:1610.10099.           nea Micciulla, and John Makhoul. 2006. A Study
                                                          of Translation Edit Rate with Targeted Human An-
Diederik P Kingma and Jimmy Ba. 2014. Adam: A             notation. In Proceedings of Association for Machine
  method for stochastic optimization. arXiv preprint      Translation in the Americas, pages 223–231, Cam-
  arXiv:1412.6980.                                        bridge, Massachusetts, USA.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
   Sequence to sequence learning with neural net-
   works. In Proceedings of the 27th International
   Conference on Neural Information Processing Sys-
   tems - Volume 2, NIPS’14, pages 3104–3112, Cam-
   bridge, MA, USA. MIT Press.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In I. Guyon, U. V. Luxburg, S. Bengio,
  H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
  nett, editors, Advances in Neural Information Pro-
  cessing Systems 30, pages 5998–6008. Curran As-
  sociates, Inc.

Barret Zoph and Kevin Knight.         2016.   Multi-
  source neural translation.          arXiv preprint
  arXiv:1601.00710.