Online Neural Automatic Post-editing for Neural Machine Translation

Matteo Negri1            Marco Turchi1                Nicola Bertoldi1,2          Marcello Federico1,2
                            1
                          Fondazione Bruno Kessler - Trento, Italia
                                2
                                  MMT Srl - Trento, Italia
                    [negri,turchi,bertoldi,federico]@fbk.eu


                    Abstract                             et al., 2007). Cast as a problem of “monolin-
                                                         gual translation” (from raw MT output into im-
    English. Machine learning from user cor-             proved text in the same target language), APE
    rections is key to the industrial deploy-            has followed a similar evolution to that of MT.
    ment of machine translation (MT). We in-             As in MT, APE research received a strong boost
    troduce the first on-line approach to auto-          from shared evaluation exercises like those orga-
    matic post-editing (APE), i.e. the task of           nized within the well-established WMT Confer-
    automatically correcting MT errors. We               ence on Machine Translation (Chatterjee et al.,
    present experimental results of APE on               2018). In terms of approaches, early MT-like
    English-Italian MT by simulating human               phrase-based solutions (Béchara et al., 2011; Rosa
    post-edits with human reference transla-             et al., 2013; Lagarda et al., 2015; Chatterjee et
    tions, and by applying online APE on MT              al., 2015) have been recently outperformed and re-
    outputs of increasing quality. By evaluat-           placed by neural architectures that now represent
    ing APE on generic vs. specialised and               the state of the art (Junczys-Dowmunt and Grund-
    static vs. adaptive neural MT, we address            kiewicz, 2016; Chatterjee et al., 2017a; Tebbi-
    the question: At what cost on the MT side            fakhr et al., 2018; Junczys-Dowmunt and Grund-
    will APE become useless?                             kiewicz, 2018). From the industry standpoint,
                                                         APE has started to attract MT market players in-
    Italiano. L’apprendimento automatico                 terested in combining the two technologies to sup-
    dalle correzioni degli utenti è fonda-              port human translation in professional workflows
    mentale per lo sviluppo industriale della            (Crego et al., 2016).
    traduzione automatica (MT). In questo
                                                         Focusing on this industry-oriented perspective,
    lavoro, introduciamo il primo approccio
                                                         this paper makes a step further on APE research
    on-line al post-editing automatico (APE),
                                                         by exploring an online neural approach to the
    ovvero il compito di correggere automati-
                                                         task. The goal is to leverage human feedback
    camente gli errori della MT. Presentiamo
                                                         (post edits) to improve on-the-fly a neural APE
    risultati di online APE su MT da inglese
                                                         model without the need of stopping it for fine-
    a italiano simulando le correzioni umane
                                                         tuning or re-training from scratch. Online learn-
    con traduzioni manuali già disponibili e
                                                         ing capabilities are crucial (both for APE and
    utilizzando MT di qualità crescente. Val-
                                                         MT) in computer-assisted translation scenarios
    utando l’APE su MT neurale generica op-
                                                         where professional translators operate on sugges-
    pure specializzata, statica o adattiva, af-
                                                         tions provided by machines. In such scenarios, hu-
    frontiamo la domanda di fondo: a fronte
                                                         man corrections represent an invaluable source of
    di quale costo sul lato MT l’APE diventerà
                                                         knowledge that systems should exploit to enhance
    inutile?
                                                         users’ experience and increase their productivity.
                                                         Towards these objectives we provide two contri-
1   Introduction                                         butions. One is the first online approach to neural
                                                         APE. Indeed, while MT-like online learning tech-
Automatic Post-editing for MT is a supervised            niques have been proposed for phrase-based APE
learning task aimed to correct errors in a machine-      (Ortiz-Martı́nez and Casacuberta, 2014; Simard
translated text (Knight and Chander, 1994; Simard        and Foster, 2013; Chatterjee et al., 2017b), nothing
has been done yet under the state-of-the-art neural              selection mechanism that updates the model by
paradigm. In doing this, the other contribution is               learning from previously collected triplets that are
the first evaluation of neural APE run on the output             similar to the input test item (see lines 2-5 in Al-
of neural MT (NMT). So far, published results re-                gorithm 1);
port significant gains1 when APE is run to correct               (2) After post-editing, by means of a model adap-
the output of a phrase-based MT system. To our                   tation procedure that learns from human revisions
knowledge, the true potential of APE with higher                 of the last automatic correction generated by the
quality NMT output has not been investigated yet.                system (lines 8-10).
The last observation introduces a more general                      Similar to the methods proposed in (Chatter-
discussion on the relation between MT and APE.                   jee et al., 2017b) and (Farajian et al., 2017),
Since, by definition, APE’s reason of being is the               the instance-selection technique (first update step)
sub-optimal quality of MT output, one might won-                 consists of two components: i) a knowledge base
der if the level of current MT technology still justi-           (KB) that is continuously fed with the processed
fies efforts on APE. Along this direction, our third             triplets, and ii) an information retrieval engine
contribution is an analysis of online neural APE                 that, given the (src, mt) test item, selects the most
applied to the output of NMT systems featuring                   similar triplet (lines 2-3). The engine is simulta-
different levels of performance. Our competitors                 neously queried using both src and mt segments
range from a generic model trained on large paral-               and it returns the triplet that has the highest co-
lel data (mimicking the typical scenario in which                sine similarity with both (Top(R)). If the similar-
industry users – e.g. Language Service Providers                 ity is above a threshold τ , a few training iterations
– rely on web-based services or other black-box                  are run to update the model parameters (line 5).
systems) to highly customized online models (like                Depending on the application scenario, KB can be
those that LSPs would desire but typically cannot                pre-filled with the APE training data or left empty
afford). Our experiments in this range of condi-                 and filled only with the incoming triplets. In our
tions aim to shed light on the future of APE from                experiments, the repository is initially empty.
the industry standpoint by answering the question:
At what cost on the MT side will APE become                         Algorithm 1: Online neural APE
useless?                                                            Require M: Trained APE model
                                                                    Require Ts: Stream of test data
2       Online neural APE                                           Require KB: Pool of (src, mt, hpe) triplets
                                                                    1: while pop (src, mt) from Ts do
APE training data usually consist of (src, mt, hpe)                 2: R ← Retrieve ((src, mt), KB)
                                                                    3: (srctop , mttop , hpetop ) ← Top (R)
triplets whose elements are: a source sentence                      4: if Sim ((srctop , mttop , hpetop ), (src, mt)) > τ do
(src), its translation (mt) and a human correction                  5:     M∗ ← Update (M,(srctop , mttop , hpetop ))
                                                                    6: ape ← APE (M∗ ,(src, mt))
of the translated sentence (hpe). Models trained                    7: hpe ← HumanPostEdit ((src, ape))
on such triplets are then used to correct the mt el-                8: KB ← KB ∪ (src,mt,hpe)
ement of (src, mt) test data. Neural approaches                     9: M∗∗ ← Update (M∗ ,(src, mt, hpe))
                                                                    10: M ← M∗∗
to the task have shown their effectiveness in batch                 11: end while
conditions, in which a static pre-trained model is
run on the whole test corpus. When moving to an
                                                                    Once the hpe has been generated, the second up-
online setting, instead, APE systems should ide-
                                                                 date step takes place (line 9) by running few train-
ally be able to continuously evolve by stepwise
                                                                 ing iterations on the (src, hpe) pair. When training
learning from the interaction with the user. This
                                                                 using only one single data point, the learning rate
means that, each time a new post-edit becomes
                                                                 and the number of epochs have a crucial role be-
available, the model has to update its parameters
                                                                 cause too high/small values can make the training
on-the-fly in order to produce better output for the
                                                                 unstable/inefficient. To avoid such problems, we
next incoming sentence. To this aim, we extend a
                                                                 connect the two parameters by applying a time-
batch APE model by adding the capability to con-
                                                                 based decay learning rate that reduces the learning
tinuously learn from human corrections of its own
                                                                 rate when increasing of the number of epochs (i.e.
output. This is done in two steps:
                                                                 lr = lr/(1+num epoch)). In our experiments, this
(1) Before post-editing, by means of an instance
                                                                 strategy results in better performance than setting
    1
        Up to 7.6 BLEU points at WMT 2017 (Bojar et al., 2017)   a fixed learning rate.
3     Experiments                                         translation, few training iterations adapt it to the
                                                          domain of the input document. The adaptation
We run our experiments on English-Italian data,
                                                          steps implement the same strategies of the online
by comparing the performance of different neural
                                                          APE system (see §2). This setting represents the
APE models (batch and online) used to correct the
                                                          situation in which our LSP has access to the inner
output of NMT systems of increasing quality.
                                                          workings of a competitive online NMT system.
3.1    Data                                               Specialized (S). This model is built by fine-tuning
                                                          (Luong and Manning, 2015) G on the in-domain
To train our NMT models we use both generic
                                                          training data (416K). It reflects the condition in
and in-domain data. Generic data cover a vari-
                                                          which our LSP has access both to customer’s data
ety of domains. They comprise about 53M par-
                                                          and to the inner workings of a competitive batch
allel sentences collected from publicly-available
                                                          NMT engine. The adaptation routine, however, is
collections (i.e. all the English-Italian parallel cor-
                                                          limited to the standard approach of performing ad-
pora available on OPUS2 ) and about 50M sen-
                                                          ditional training steps on the in-domain data.
tence pairs from proprietary translation memories.
                                                          Specialized Online (SO). This model is built by
Generic data, whose size is per se sufficient to
                                                          combining the functionalities of GO and S. It uses
train a competitive general-purpose engine, are
                                                          the in-domain training data for fine-tuning and the
used to build our basic NMT model. On top of it,
                                                          incoming (src, hpe) pairs for online adaptation to
in-domain (information technology) data are used
                                                          the target domain. This setting represents the sit-
in different ways to obtain improved, domain-
                                                          uation in which our LSP has access to: i) cus-
adapted models. In-domain data are selected to
                                                          tomer’s in-domain data and ii) the inner workings
emulate the online setting of industrial scenarios
                                                          of a competitive online NMT engine.
where input documents are processed sequentially
                                                             All the models are trained with the ModernMT
on a sentence-by-sentence basis. They consist in a
                                                          open source software,3 which is built on top of
proprietary translation project of about 421K seg-
                                                          OpenNMT-py (Klein et al., 2017). It employs
ments, which are split in training (416K segments)
                                                          an LSTM-based recurrent architecture with atten-
and test (5,472) keeping the sentence order. Post-
                                                          tion (Bahdanau et al., 2014) using 2 bi-directional
edits are simulated using references.
                                                          LSTM layers in the encoder, 4 left-to-right LSTM
   To train the APE models we use the English-
                                                          layers in the decoder, and a dot-product attention
Italian section of the eSCAPE corpus (Negri et al.,
                                                          model (Luong et al., 2015). In our experiments
2018). It consists of about 6.6M synthetically-
                                                          we used an embeddings’ size of 1024, LSTMs of
created triplets in which the mt element is pro-
                                                          size 1024, and a source and target vocabulary of
duced with phrase-based and neural MT systems.
                                                          32K words, jointly trained with the BPE algorithm
3.2    NMT models                                         (Sennrich et al., 2016). The fact that ModernMT
                                                          already implements the online adaptation method
Our NMT models feature increasing levels of
                                                          presented in (Farajian et al., 2017) simplified our
complexity, so to represent a range of conditions
                                                          tests with online neural APE run on the output of
in which a user (say a Language Service Provider)
                                                          competitive NMT systems (GO and SO).
has access to different resources in terms of MT
technology and/or data for training and adaptation.       3.3     APE models
Our systems, ranked in terms of complexity with
respect to these two dimensions are:                      We experiment with two neural APE systems:
Generic (G). This model is trained on the large           Generic APE. This batch system is trained only
(103M) multi-domain parallel corpus. It repre-            on generic data (6.6M triplets from eSCAPE) and
sents the situation in which our LSP entirely re-         is similar to those tested in the APE shared task
lies on an off-the-shelf, black-box MT engine that        at WMT. The main difference is that the training
cannot be improved via domain adaptation.                 data are neither merged with in-domain triplets nor
Generic Online (GO). This model extends G with            selected based on target domain information.
the capability to learn from the incoming human           Online APE. This system is trained on the generic
post-edits (5,472 test items). Before and after           data and continuously learns from human post-
                                                          edits of the test set as described in §2.
   2
     http://opus.lingfil.uu.se dump of mid June
                                                             3
2017.                                                            http://github.com/ModernMT/MMT.
     MT Type             MT     Generic   Online           Online APE achieves significant4 improve-
                                 APE       APE
     Generic (G)         40.3    39.0     47.1†         ments not only over the output of G (+6.8) and
     Gen. Online (GO)    45.6    41.9     48.1†         its online extension GO (+2.5), but also over the
     Specialized (S)     52.1    45.5     53.5†         specialized model S (+1.4). The gain over GO is
     Spec. Online (SO)   55.0    47.4      54.8         particularly interesting: it shows that even when
                                                        APE and MT use the same in-domain data for on-
Table 1: APE performance on NMT outputs of dif-         line adaptation, the APE model is more reactive to
ferent quality (“†” denotes statistically significant   human feedback. Though trained on much smaller
differences wrt. the MT baseline with p<0.05).          generic corpora (6.6M triplets versus 103M paral-
                                                        lel sentences), the possibility to leverage richer in-
    The two systems are based on a multi-source         formation in the form of (src, mt, pe) instances at
attention-based encoder-decoder approach simi-          test time seems to have a positive impact. A deeper
lar to (Chatterjee et al., 2017a). It employs a         exploration of this aspect falls out of the scope of
GRU-based recurrent architecture with attention         this paper and is left as future work.
and uses two independent encoders to process the           Also with online APE, the gains become
src and mt segments. Similar to the NMT systems,        smaller by increasing the MT quality, reaching
it is trained on sub-word units by using BPE, with      a point where the system can only approach the
a vocabulary created by selecting to 50K most fre-      highest MT performance of SO (with a non-
quent sub-words. Word embedding and GRU hid-            significant -0.2 BLEU difference). This confirms
den state sizes are set to 1024. Network param-         that correcting the output of competitive NMT en-
eters are optimized with Adagrad (Duchi et al.,         gines is a hard task, even for a dynamic APE sys-
2011) with a learning rate of 0.01. A develop-          tem that learns from the interaction with the user.
ment set randomly extracted from the training data      However, besides improving its performance by
is used to set the similarity threshold used by the     learning from user feedback acquired at test time
online model for the first update step (τ =0.5) as      (similar to the APE system), SO also relies on
well as the learning rate (0.01) and the number of      previous fine-tuning on a large in-domain corpus
epochs (3) of both adaptation steps.                    (similar to S). To answer our initial question (“At
                                                        what cost on the MT side will APE become use-
4   Results and discussion                              less?”) it is worth remarking that leveraging in-
                                                        domain training/adaptation data is a considerable
APE results computed on different levels of trans-      advantage for MT but it comes at a cost that should
lation quality are reported in Table 1. Looking         not be underestimated. In terms of the data itself,
at the NMT performance, all the adaptation tech-        collecting enough parallel sentences for each tar-
niques yield significant improvements over the          get domain is a considerable bottleneck that limits
Generic model (G). The large gain achieved via          the scalability of competitive NMT solutions. In
fine-tuning on in-domain data (S: +11.8 BLEU) is        addition to that, the technology requirements (i.e.
further increased when adding online learning ca-       having access to the inner workings of the NMT
pabilities on top of it to create the most competi-     engine) and the computational costs involved (for
tive Specialized Online system (SO: +14.7).             fine-tuning the generic model) are constraints that
   As expected, the batch APE model trained on          few LSPs are probably able to satisfy.
generic data only (that is, without in-domain in-
formation) is unable to improve the quality of          5    Conclusion
raw MT output. Moreover, although APE results
increase with higher translation quality, also the      We introduced an online neural APE system,
performance distance from the more competitive          which is trained on generic data and only exploits
NMT systems becomes larger (from -1.3 to -7.6           user feedback to improve its performance, and
points respectively for G and SO). These results        evaluated it on the output of NMT systems fea-
confirm the WMT findings about the importance           turing increasing complexity and in-domain data
of domain customization for batch APE (Bojar et         demand. Our results show the effectiveness of
al., 2017), and advocate for online solutions ca-       current APE technology in the typical setting of
pable to maximize knowledge exploitation at test            4
                                                              Statistical significance is computed with paired bootstrap
time by learning from user feedback.                    resampling (Koehn, 2004).
most LSPs while, in terms of resources (especially        Patrice Brunelle, Aurelien Coquard, Yongchao
in-domain data) and technical expertise needed.           Deng, et al. 2016. SYSTRAN’s Pure Neu-
                                                          ral Machine Translation Systems. arXiv preprint
We also conclude that developing MT engines that
                                                          arXiv:1610.05540.
make APE useless is still a prerogative of few.
                                                        John Duchi, Elad Hazan, and Yoram Singer. 2011.
                                                          Adaptive Subgradient Methods for Online Learning
References                                                and Stochastic Optimization. Journal of Machine
                                                          Learning Research, 12:2121–2159, July.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
  gio. 2014. Neural Machine Translation by Jointly      M. Amin Farajian, Marco Turchi, Matteo Negri, and
  Learning to Align and Translate. arXiv preprint         Marcello Federico. 2017. Multi-Domain Neural
  arXiv:1409.0473.                                        Machine Translation through Unsupervised Adapta-
Hanna Béchara, Yanjun Ma, and Josef van Genabith.        tion. In Proceedings of the Second Conference on
  2011. Statistical Post-Editing for a Statistical MT     Machine Translation, pages 127–137, Copenhagen,
  System. In Proceedings of the 13th Machine Trans-       Denmark, September.
  lation Summit, pages 308–315, Xiamen, China,
  September.                                            Marcin Junczys-Dowmunt and Roman Grundkiewicz.
                                                         2016. Log-linear Combinations of Monolingual and
Ondřej Bojar, Rajen Chatterjee, Christian Federmann,    Bilingual Neural Machine Translation Models for
  Yvette Graham, Barry Haddow, Shujian Huang,            Automatic Post-Editing. In Proceedings of the First
  Matthias Huck, Philipp Koehn, Qun Liu, Varvara         Conference on Machine Translation, pages 751–
  Logacheva, Christof Monz, Matteo Negri, Matt           758, Berlin, Germany, August.
  Post, Raphael Rubino, Lucia Specia, and Marco
  Turchi. 2017. Findings of the 2017 Conference on      Marcin Junczys-Dowmunt and Roman Grundkiewicz.
  Machine Translation (WMT17). In Proceedings of         2018. Microsoft and University of Edinburgh at
  the Second Conference on Machine Translation, Vol-     WMT2018: Dual-Source Transformer for Auto-
  ume 2: Shared Task Papers, pages 169–214, Copen-       matic Post-Editing. In Proceedings of the Third
  hagen, Denmark, September.                             Conference on Machine Translation, Brussels, Bel-
                                                         gium, October.
Rajen Chatterjee, Marion Weller, Matteo Negri, and
  Marco Turchi. 2015. Exploring the Planet of           Guillaume Klein, Yoon Kim, Yuntian Deng, Jean
  the APEs: a Comparative Study of State-of-the-art       Senellart, and Alexander Rush. 2017. OpenNMT:
  Methods for MT Automatic Post-Editing. In Pro-          Open-Source Toolkit for Neural Machine Transla-
  ceedings of the 53rd Annual Meeting of the Associa-     tion. In Proceedings of ACL 2017, System Demon-
  tion for Computational Linguistics), pages 156–161,     strations, pages 67–72, July.
  Beijing, China, July.
                                                        Kevin Knight and Ishwar Chander. 1994. Auto-
Rajen Chatterjee, M. Amin Farajian, Matteo Ne-            mated Post-Editing of Documents. In Proceedings
  gri, Marco Turchi, Ankit Srivastava, and Santanu        of AAAI, volume 94, pages 779–784.
  Pal. 2017a. Multi-source Neural Automatic Post-
  Editing: FBK’s participation in the WMT 2017 APE      Philipp Koehn. 2004. Statistical Significance Tests for
  shared task. In Proceedings of the Second Con-          Machine Translation Evaluation. In Proceedings of
  ference on Machine Translation, Volume 2: Shared        the Empirical Methods on Natural Language Pro-
  Task Papers, pages 630–638, Copenhagen, Den-            cessing, pages 388–395, Barcelona, Spain, July.
  mark, September.

Rajen Chatterjee, Gebremedhen Gebremelak, Matteo        Antonio L. Lagarda, Daniel Ortiz-Martı́nez, Vicent Al-
  Negri, and Marco Turchi. 2017b. Online Automatic        abau, and Francisco Casacuberta. 2015. Translat-
  Post-editing for MT in a Multi-Domain Translation       ing without In-domain Corpus: Machine Transla-
  Environment. In Proceedings of the 15th Confer-         tion Post-Editing with Online Learning Techniques.
  ence of the European Chapter of the Association for     Computer Speech & Language, 32(1):109–134.
  Computational Linguistics: Volume 1, Long Papers,
  pages 525–535, Valencia, Spain, April.                Minh-Thang Luong and Christopher D Manning.
                                                          2015. Stanford Neural Machine Translation Sys-
Rajen Chatterjee, Matteo Negri, Raphael Rubino, and       tems for Spoken Language Domains. In Proceed-
  Marco Turchi. 2018. Findings of the WMT 2018            ings of the International Workshop on Spoken Lan-
  Shared Task on Automatic Post-Editing. In Pro-          guage Translation (IWSLT’15), pages 76–79, Da
  ceedings of the Third Conference on Machine Trans-      Nang, Vietnam, December.
  lation, Brussels, Belgium, October. Association for
  Computational Linguistics.                            Minh Thang Luong, Hieu Pham, and Christopher D
                                                          Manning. 2015. Effective Approaches to Attention-
Josep Crego, Jungi Kim, Guillaume Klein, Anabel Re-       based Neural Machine Translation. arXiv preprint
   bollo, Kathy Yang, Jean Senellart, Egor Akhanov,       arXiv:1508.04025.
Matteo Negri, Marco Turchi, Rajen Chatterjee, and
 Nicola Bertoldi. 2018. eSCAPE: a Large-scale
 Synthetic Corpus for Automatic Post-Editing. In
 Proceedings of the Eleventh International Confer-
 ence on Language Resources and Evaluation (LREC
 2018), Miyazaki, Japan, May.
Daniel Ortiz-Martı́nez and Francisco Casacuberta.
  2014. The New THOT Toolkit for Fully-Automatic
  and Interactive Statistical Machine Translation. In
  Proceedings of the 14th Annual Meeting of the Eu-
  ropean Association for Computational Linguistics,
  pages 45–48, Gothenburg, Sweden, April.
Rudolf Rosa, David Marecek, and Ales Tamchyna.
  2013. Deepfix: Statistical Post-editing of Statistical
  Machine Translation Using Deep Syntactic Analy-
  sis. In Proceedings of the 51st Annual Meeting
  of the Association for Computational Linguistics,
  pages 172–179, Sofia, Bulgaria, August.

Rico Sennrich, Barry Haddow, and Alexandra Birch.
  2016. Neural Machine Translation of Rare Words
  with Subword Units. In Proceedings of the 54th An-
  nual Meeting of the Association for Computational
  Linguistics (Volume 1: Long Papers), pages 1715–
  1725, Berlin, Germany, August.

Michel Simard and George Foster. 2013. PEPr: Post-
  edit Propagation Using Phrase-based Statistical Ma-
  chine Translation. In Proceedings of the XIV Ma-
  chine Translation Summit, pages 191–198, Nice,
  France, September.
Michel Simard, Cyril Goutte, and Pierre Isabelle.
  2007. Statistical Phrase-Based Post-Editing. In
  Proceedings of the Annual Conference of the North
  American Chapter of the Association for Computa-
  tional Linguistics, pages 508–515, Rochester, New
  York, April.
Amirhossein Tebbifakhr, Ruchit Agrawal, Rajen Chat-
 terjee, Matteo Negri, and Marco Turchi. 2018.
 Multi-source Transformer with Combined Losses
 for Automatic Post-Editing. In Proceedings of the
 Third Conference on Machine Translation, Brussels,
 Belgium, October.