=Paper= {{Paper |id=Vol-2962/paper26 |storemode=property |title=Machine Translation of Covid-19 Information Resources via Multilingual Transfer |pdfUrl=https://ceur-ws.org/Vol-2962/paper26.pdf |volume=Vol-2962 |authors=Ivana Kvapilíková,Ondřej Bojar |dblpUrl=https://dblp.org/rec/conf/itat/KvapilikovaB21 }} ==Machine Translation of Covid-19 Information Resources via Multilingual Transfer == https://ceur-ws.org/Vol-2962/paper26.pdf
     Machine Translation of Covid-19 Information Resources via Multilingual
                                   Transfer

                                                   Ivana Kvapilíková and Ondřej Bojar

                                          Charles University, Faculty of Mathematics and Physics
                                                Institute of Formal and Applied Linguistics
                                          Malostranské náměstí 25, 118 00 Prague, Czech Republic
                                                     @ufal.mff.cuni.cz

Abstract: The Covid-19 pandemic has created a global de-                  system using multilingual corpora. Section 3 describes
mand for accurate and up-to-date information which often                  our training data and section 4 gives more details about
originates in English and needs to be translated. To train                our MT systems and presents the results. Section 5 con-
a machine translation system for such a narrow topic, we                  cludes the paper.
leverage in-domain training data in other languages both
from related and unrelated language families. We exper-
iment with different transfer learning schedules and ob-                  2     Methodology
serve that transferring via more than one auxiliary lan-
guage brings the most improvement. We compare the per-                    In this section we outline several strategies applicable in
formance with joint multilingual training and report supe-                the situation where we need to translate from English to
rior results of the transfer learning approach.                           multiple languages, we are confined within a specific do-
                                                                          main and we have mid-size parallel corpora for every lan-
                                                                          guage pair of interest.
1    Introduction                                                            Firstly, we can train a standard MT system from scratch
                                                                          for each language separately, possibly resorting to some
A global crisis such as the current Covid-19 pandemic re-
                                                                          data augmentation method, e.g. back-translation. Sec-
quires information to be spread as efficiently as possible.
                                                                          ondly, we can use transfer learning to transfer from a pre-
Working with data from different international resources
                                                                          trained MT system in one language to another. Finally, we
in multiple languages can resolve possible inconsistencies
                                                                          can train a multilingual MT system which learns jointly
and prevent misinformation. In an emergency situation,
                                                                          from all available data.
new data is released constantly and is communicated to the
                                                                             In our experiments, we do not consider any additional
public not only via national news and authorities, but also
                                                                          monolingual resources. Although monolingual data are
foreign media, scientific journals or statements of interna-
                                                                          generally easier to obtain, we remain constrained by the
tional agencies. There are extensive data resources written
                                                                          datasets provided by the organizers of MLIA @ Eval
in English which are not accessible for non-English speak-
                                                                          which are described in Section 3. We also do not evaluate
ers.
                                                                          a transfer from a large MT model pretrained on texts from
   In order to quickly access the information in a foreign
                                                                          the general domain which would be a promising strategy
language, machine translation (MT) can be of great help.
                                                                          as well.
However, Covid-related texts use a specific terminology
and MT models are known to struggle outside of the gen-
eral domain.                                                              2.1   Low-resource Neural Machine Translation
   More than a year after the Covid outbreak, there al-
ready is a significant amount of domain-specific multi-                   When neural machine translation (NMT) became the dom-
lingual text resources. Furthermore, Covid-related texts                  inant paradigm in MT [19], it was believed that extremely
are a part of a broader medical domain which can pro-                     large parallel resources are required for training. However,
vide additional authentic data for training. Thanks to the                Sennrich and Zhang [18] showed that with careful tuning
MLIA @ Eval1 initiative who gathered training data for                    of the hyperparameters, an NMT model can be success-
MT and information retrieval related to the pandemic, we                  fully trained already on 100k sentence pairs, which is less
can successfully adapt an MT system to the Covid domain                   than we have available in the Covid/medical domain for
or even train it from scratch.                                            the language pairs of our interest. Furthermore, Conneau
   This paper gives an overview of possible methods to au-                and Lample [4] show that translation quality can be fur-
tomatically translate Covid-related texts. Section 2 out-                 ther boosted by pretraining a language model and using
lines different approaches to train a domain-specific MT                  it to initialize the parameters of both the encoder and the
                                                                          decoder.
     Copyright ©2021 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).          An NMT system directly trained to translate in the
    1 http://eval.covid19-mlia.eu/                                        Covid domain serves as our baseline.
                                      de         el           es           fr              it            sv
                     Train         925,647    834,240     1,028,287    1,004,215        900,472       806,425
                     Dev             528       3,378        1,973         728            3,245          723
                     Dev Test        500        500          500          500             500           500
                     Blind Test     2,000      2,000        2,000        2,000           2,000         2,000

                                                   Table 1: Data summary


2.2   Data Augmentation                                         → . . . ) or for the original parent (e.g. German → English
                                                                → German → . . . ), as illustrated in Figure 1. We always
Back-translation is a crucial method in NMT used to aug-        generate the vocabulary from the concatenation of the par-
ment training data by translating an existing monolingual       ent and its "first child". When adding a third (or fourth)
corpus [16]. The synthetic text can be either on the source     language, the joint BPE vocabulary has to be modified by
[16] or the target [21] side of the training corpus, or both    replacing the original parent vocabulary entries with the
[15].                                                           new child ones. The schedules and their results are de-
   When using a bidirectional model (sharing the en-            scribed in Section 4.
coder and decoder for both translation directions), back-
translation can be performed on-the-fly. During training,
the model switches between the training and the inference       2.4    Multilingual Training
mode to produce batches of synthetic sentence pairs and
                                                                The second strategy to utilize multilingual in-domain
learn from both authentic and synthetic samples in each
                                                                training corpora is joint multililngual training.
training step. As the system improves, the quality of the
                                                                   Multilingual translation systems are either trained with
generated samples improves as well. This approach was
                                                                full parameter sharing [1, 7, 9], with language-specific en-
originally proposed for training an unsupervised MT sys-
                                                                coders and decoders relying on shared attention [5] or an
tem [2, 14].
                                                                attention bridge [20]. The results show that multilingual
   In our systems, we translate only the target sentences
                                                                models yield comparable or even superior results to the
and generate a synthetic source side on the fly. We do not
                                                                standard bilingual setup.
use any additional monolingual data for back-translation.
                                                                   In this work, we rely on full parameter sharing and
                                                                use the same architecture as our bilingual systems, while
2.3   Transfer Learning                                         training it to translate from English into three languages
                                                                (French, Italian and Spanish) at once. During inference,
The first strategy to utilize multilingual in-domain training   the target language is determined from indicated language
corpora is transfer learning. It can be used to transfer from   embeddings of the target sentence. We selected these three
a different domain [6] or a different language [12, 22]. In     languages for their similarity which could help the model
this work we focus on the latter.                               re-use and share some knowledge. The BPE vocabulary
   A trivial transfer learning approach was proposed by         was extracted from the concatenation of all four corpora,
Kocmi and Bojar [12] who fine-tune a low-resource child         using only unique English sentences to reach a comparable
model from a high-resource parent model pretrained for a        corpus size.
different language pair. The training procedure consists of
first training an NMT model on the parent parallel corpus
until it converges and then replacing the training data with    3     Data
the child corpus.
   Before training the parent model, it is necessary to des-    Covid-19 MLIA @ Eval organized a community eval-
ignate some vocabulary entries for the new language. Oth-       uation effort aimed at accelerating the creation of re-
erwise the model would be forced to completely re-learn         sources and tools for improved Multilingual Information
its subword embeddings and their connections and would          Access (MLIA). A part of this initiative is a competition
lose its ability to transfer. Kocmi [11] shows that the best    to develop the best MT system translating from English
strategy is to generate the vocabulary in advance from the      to several European languages: German, Modern Greek,
concatenation of corpora of both the child and the par-         French, Italian, Spanish and Swedish. The competition is
ent language pair. However, if the child language is not        incremental and so far only the first round was concluded.
known prior to training the parent, it is enough to leave          The parallel training data provided by the organizers for
some "free" slots in the vocabulary and later fill them in      the first round and used in this paper is summarized in Ta-
with the vocabulary of the child language.                      ble 1.2 It was created based on existing corpora from the
   In this work, we experiment with several transfer learn-     medical domain, enriched with sentences directly about
ing schedules. We repeat the transfer procedure several             2 The development test set used for the final model selection was
times with the child becoming the parent for either a com-      obtained by cutting 500 sentences off of either the train set or the devel-
pletely new language (e.g. German → English → Spanish           opment set, depending on the original development set size.
                       English → German                                            EN → DE              Train
                                                                                   EN → SV              Fine-tune
                                                      EN → DE       EN → DE        EN → DE              Fine-tune
                                           EN → ES    EN → ES       EN → SV        EN → SV              Fine-tune
                            EN → DE        EN → DE    EN → DE       EN → DE        EN → DE              Fine-tune
                     dev
                    BLEU      20.76         21.26      22.60         22.55          22.50


                       English → Italian                                                                Train
                                                                    EN → DE
                                                                    EN → ES                             Fine-tune
                                                      EN → DE       EN → IT                             Fine-tune

                                           EN → ES    EN → ES       EN → ES                             Fine-tune

                             EN → IT       EN → IT    EN → IT       EN → IT                             Fine-tune
                     dev
                    BLEU      30.97         31.68      32.10         33.07




             Figure 1: Illustration of the incremental transfer for selected languages: BLEU scores on dev set.

                                                       de          el       es       fr        it         sv
                       bidirectional with BT          21.52      22.30     40.94    38.46    33.17       20.61
                       unidirectional without BT      20.76      22.70     40.46    35.57    30.97       19.13

         Table 2: Translating from English using the baseline model and back-translation: BLEU scores on dev set.


Covid, mostly harvested through web crawling and par-               trained on 4 GPUs4 with 2-step gradient accumulation to
allel sentence mining [3]. The sentences in different lan-          reach an effective batch size of 8 × 3400 tokens. Ef-
guages might be similar, but the entire corpus collection is        fective batch size has a significant impact on the training
not multi-parallel.                                                 and we observe that the models converge on lower BLEU
   All data was segmented into BPE units [17] with a vo-            scores for smaller batch sizes. We used Adam [10] opti-
cabulary of 30k items for the training.                             mizer with inverse square root decay (β1 = 0.9, β2 = 0.98,
                                                                    lr = 0.0001). Beam search with the beam size of 4 was
                                                                    used during final decoding; greedy decoding was used for
4     Experiments & Results
                                                                    back-translation. The vocabulary size was set to 30k. Us-
We participated in the MT shared task of the Covid-19               ing larger vocabulary leads to a performance drop. All our
MLIA @ Eval initiative and trained a model for translation          model parameters are initialized with a pretrained masked
into each of the six languages listed in Section 3. The re-         language model as described in Conneau and Lample [4].
sults of our submitted systems are summarized in the pre-
liminary report [13], the overall results are discussed in the      4.1     Online Back-Translation
shared task findings [3]. Our English → German and En-
glish → Swedish systems ranked first (tied with one other           For each language pair we first trained a bidirectional
system), our other models ranked second.                            back-translation model described in Section 2 and com-
   We experimented with three training strategies com-              pared it to a standard unidirectional model without back-
pared against one baseline BASE:                                    translation. Online back-translation improved the score by
                                                                    0.5–2.9 BLEU points, depending on the language, but sur-
    1. unidirectional training without back-translation             prisingly caused a decrease of 0.4 BLEU in the case of the
       (BASE);                                                      English–Modern Greek model. The reason for this drop is
    2. bidirectional training with online back-translation          likely in the bidirectionality of the model rather than the
       (BT);                                                        data augmentation itself. The results are summarized in
    3. transfer learning (TRANSFER);                                Table 2.
    4. multilingual training (MULTILING).                              We experimented with a dropout of 0.1 and 0.2 and con-
  For all our MT models we use a 6-layer Transformer                cluded that higher dropout helps in most settings. This
[19] architecture with 8 heads, embedding dimension of              observation is in line with Sennrich and Zhang [18] who
1024 and GELU [8] activations. The training is performed            emphasize the role of higher dropout when working with
using the XLM3 toolkit. The translation models were                 low- to medium- sized resources.
     3 https://github.com/facebookresearch/XLM                            4 Quadro P5000, 16GB of RAM
             Transfer Combination                              de         el        es        fr        it        sv
             en-es → en-de                                    21.26
             en-de → en-es                                                        41.28
             en-de → en-es → en-de                            22.60
             en-de → en-es → en-fr                                                          35.10
             en-de → en-es → en-it                                                                    32.10
             en-de → en-es → en-it → en-es                                        41.34
             en-de → en-es → en-it → en-es → en-it                                                    33.07
             en-es → en-fr                                                                  32.43
             en-es → en-it                                                                            31.68
             en-de → en-el                                              23.29
             en-es → en-el                                              20.91
             en-de → en-sv                                                                                      21.69
             en-de → en-sv → en-de                            22.55
             en-de → en-sv → en-de → en-sv                                                                      20.56
             en-de → en-sv → en-de → en-sv → en-de            22.50

                 Table 3: Translating from English using the TRANSFER models: BLEU scores on dev set.


                   de     el     es     fr      it     sv        experiments are described in Table 3 and selected sched-
 multilingual       -     -     40.2   36.1    32.8     -        ules are illustrated in Figure 1.
 best transfer    22.6   23.3   41.3   35.1    33.1   21.7          We observe that transfer learning improves the perfor-
 best base        21.5   22.7   40.9   38.5    33.2   20.6       mance in all cases but French, where the BASE model with
                                                                 BT reaches 38.5 BLEU, which is ∼ 3 BLEU points more
Table 4: Translating from English using the best models          than transfer learning. There is a significant overlap be-
from each category: BLEU scores on dev set.                      tween the training sets in different languages and it is pos-
                                                                 sible that French does not benefit from the transfer because
                   de     el     es     fr      it     sv        it does not provide enough new sentences. On the other
 multilingual       -     -     47.3   48.0    28.3     -        hand, the largest improvement is seen by the language pair
 best transfer    31.6   24.7   47.9   47.1    28.3   30.1       with the least amount of training data, English-Swedish,
 best base        31.4   24.1   47.3   48.4     -     28.5       where BLEU increases by 1.1 points on the dev set and
                                                                 1.6 points on the test set.
Table 5: Translating from English using the best models
from each category: BLEU scores on blind test set.
                                                                 4.3     Multilingual Training
4.2   Transfer Learning                                          We train a multilingual model for translation from English
                                                                 to French, Italian and Spanish. The model has the same
We used the best-performing BASE / BT models as the
                                                                 architecture as our bilingual models, all parameters are
parent models and continued with unidirectional training
                                                                 shared for all languages. Its encoder and decoder were
(English → foreign language) for our transfer learning ex-
                                                                 first pretrained on monolingual data in all three languages
periments. Since the fine-tuning is unidirectional, we can
                                                                 and English using the MLM criterion [4].
no longer perform online back-translation.
                                                                    Table 4 shows the comparison of the TRANSFER mod-
   We observed that it often helped to use the transfer
                                                                 els with a multilingual model trained jointly. We observe
incrementally, having the model converge on one paral-
                                                                 that transfer learning yields superior results and is thus a
lel corpus, switch the target language, wait for conver-
                                                                 more effective way to leverage multilingual data than joint
gence and switch again. We hypothesize that the model
                                                                 multilingual training. However, there is an advantage of a
benefits from seeing a larger variety of sentences. For
                                                                 joint model in terms of the training and storage cost. After
example transferring from German to Spanish to Italian
                                                                 three days of training, the multilingual model can be used
(32.10 BLEU) performs better than transferring directly
                                                                 for translation into all three languages. The initial BASE
from Spanish to Italian (31.68 BLEU). The best combi-
                                                                 models can take between one (without BT) and five (with
nation is to even repeat the Spanish-Italian transfer twice
                                                                 BT) days to train and fine-tuning on a child language pair
(33.07 BLEU).
                                                                 adds around 6 hours.
   When translating from English to German, fine-
                                                                    Table 5 lists our task submissions and compares all ap-
tuning the en-de BT model on English→Spanish (or
                                                                 proaches on the official Covid-19 MLIA @ Eval blind test
English→Swedish) and switching back to English → Ger-
                                                                 set.5
man adds around 1 BLEU on top of the original BT model.
All language combinations used in our transfer learning                5 The BLEU scores in Table 4 and Table 5 cannot be directly com-
5    Conclusion                                                              [4] Conneau, A., Lample, G.: Cross-lingual language
                                                                                 model pretraining. In: Wallach, H., Larochelle, H.,
We trained several MT systems specialized in translation                         Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett,
of texts related to the topic of Covid-19 and the pandemic                       R. (eds.) Advances in Neural Information Processing
from English to six European languages.                                          Systems 32, pp. 7059–7069, Curran Associates, Inc.
   We experimented with three training approaches and we                         (2019)
conclude that there is not a universal winner that would
work the best for all language pairs. However, transfer                      [5] Firat, O., Cho, K., Bengio, Y.: Multi-way, mul-
learning brings promising results across the board, espe-                        tilingual neural machine translation with a shared
cially when training data is limited. We observed an inter-                      attention mechanism. In: Proceedings of the 2016
esting phenomenon where incremental fine-tuning on mul-                          Conference of the North American Chapter of the
tiple languages brings additional gains, as we expose the                        Association for Computational Linguistics: Human
model to a larger variety of training sentences.                                 Language Technologies, pp. 866–875, Association
   In our setting, transferring knowledge is a more efficient                    for Computational Linguistics, San Diego, Califor-
way to leverage multilingual data than joint training. For                       nia (Jun 2016)
English→German, we observe that a transfer learning de-
                                                                             [6] Freitag, M., Al-Onaizan, Y.: Fast domain adap-
tour via Spanish or Swedish improves the parent model
                                                                                 tation for neural machine translation. CoRR
itself. For English→Modern Greek, transfer learning via
                                                                                 abs/1612.06897 (2016)
German works well, despite the unrelatedness of the two
languages. For English→French, on the other hand, a bidi-                    [7] Ha, T.L., Niehues, J., Waibel, A.: Toward multilin-
rectional model with back-translation beats both multilin-                       gual neural machine translation with universal en-
gual and transfer-based models.                                                  coder and decoder (2016)

                                                                             [8] Hendrycks, D., Gimpel, K.: Bridging nonlinearities
Acknowledgments                                                                  and stochastic regularizers with gaussian error linear
                                                                                 units. CoRR abs/1606.08415 (2017)
This study was supported in parts by the grants
CZ.07.1.02/0.0/0.0/16_023/0000108 (Operational Pro-                          [9] Johnson, M., Schuster, M., Le, Q.V., Krikun, M.,
gramme – Growth Pole of the Czech Republic),                                     Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg,
19-26934X of the Czech Science Foundation, and by the                            M., Corrado, G., Hughes, M., Dean, J.: Google’s
SVV project number 260 575.                                                      multilingual neural machine translation system: En-
                                                                                 abling zero-shot translation. Transactions of the As-
References                                                                       sociation for Computational Linguistic 5, 339–351
                                                                                 (2017)
 [1] Aharoni, R., Johnson, M., Firat, O.: Massively mul-
     tilingual neural machine translation. In: Proceedings                  [10] Kingma, D.P., Ba, J.: Adam: A method for stochas-
     of the 2019 Conference of the North American Chap-                          tic optimization. In: Proceedings of the 3rd Inter-
     ter of the Association for Computational Linguistics:                       national Conference for Learning Representations
     Human Language Technologies, Volume 1 (Long                                 (2015)
     and Short Papers), pp. 3874–3884, Association for                      [11] Kocmi, T.: Exploring Benefits of Transfer Learning
     Computational Linguistics, Minneapolis, Minnesota                           in Neural Machine Translation. Ph.D. thesis, Charles
     (Jun 2019)                                                                  University (2020)
 [2] Artetxe, M., Labaka, G., Agirre, E., Cho, K.: Un-                      [12] Kocmi, T., Bojar, O.: Trivial transfer learning for
     supervised neural machine translation. In: Proceed-                         low-resource neural machine translation. In: Pro-
     ings of the Sixth International Conference on Learn-                        ceedings of the Third Conference on Machine Trans-
     ing Representations (April 2018)                                            lation: Research Papers, pp. 244–252, Association
 [3] Casacuberta, F., Ceausu, A., Choukri, K., Deli-                             for Computational Linguistics, Brussels (Oct 2018)
     giannis, M., Domingo, M., Garcia-Martinez,
                                                                            [13] Kvapilíková, I.: CUNI machine translation systems
     M., Herranz, M., Papavassiliou, V., Piperidis,
                                                                                 for the Covid-19 MLIA initiative (2021), URL
     S., Prokopidis, P., Roussis., D.: The Covid-
                                                                                 http://eval.covid19-mlia.eu/meetings/
     19 MLIA @ Eval Initiative: Overview of the
                                                                                 round1/report/20210114-cunimt.pdf
     machine translation task (2021), URL http:
     //eval.covid19-mlia.eu/meetings/round1/                                [14] Lample, G., Denoyer, L., Ranzato, M.: Unsupervised
     report/20210112-task3-overview.pdf                                          machine translation using monolingual corpora only.
pared as the dev scores were calculated by authors and test scores by the        In: 6th International Conference on Learning Repre-
organizers.                                                                      sentations (ICLR 2018) (2018)
[15] Niu, X., Denkowski, M., Carpuat, M.: Bi-directional
     neural machine translation with synthetic parallel
     data. In: Proceedings of the 2nd Workshop on Neural
     Machine Translation and Generation, pp. 84–91, As-
     sociation for Computational Linguistics, Melbourne,
     Australia (Jul 2018)

[16] Sennrich, R., Haddow, B., Birch, A.: Improving
     neural machine translation models with monolingual
     data. In: Proceedings of the 54th Annual Meeting of
     the ACL (Volume 1: Long Papers), pp. 86–96, Asso-
     ciation for Computational Linguistics, Berlin, Ger-
     many (Aug 2016)

[17] Sennrich, R., Haddow, B., Birch, A.: Neural machine
     translation of rare words with subword units. In: Pro-
     ceedings of the 54th Annual Meeting of the ACL, pp.
     1715–1725, Association for Computational Linguis-
     tics, Berlin (Aug 2016)

[18] Sennrich, R., Zhang, B.: Revisiting low-resource
     neural machine translation: A case study. In: Pro-
     ceedings of the 57th Annual Meeting of the Associa-
     tion for Computational Linguistics, pp. 211–221, As-
     sociation for Computational Linguistics, Florence,
     Italy (Jul 2019)
[19] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
     Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.:
     Attention is all you need. In: Guyon, I., Luxburg,
     U.V., Bengio, S., Wallach, H., Fergus, R., Vish-
     wanathan, S., Garnett, R. (eds.) Advances in Neural
     Information Processing Systems 30, pp. 6000–6010,
     Curran Associates, Inc. (2017)
[20] Vázquez, R., Raganato, A., Tiedemann, J., Creutz,
     M.: Multilingual NMT with a language-independent
     attention bridge. In: Proceedings of the 4th
     Workshop on Representation Learning for NLP
     (RepL4NLP-2019), pp. 33–39, Association for Com-
     putational Linguistics, Florence, Italy (Aug 2019)
[21] Wu, L., Wang, Y., Xia, Y., Qin, T., Lai, J., Liu, T.Y.:
     Exploiting monolingual data at scale for neural ma-
     chine translation. In: Proceedings of the 2019 Con-
     ference on Empirical Methods in Natural Language
     Processing and the 9th International Joint Confer-
     ence on Natural Language Processing (EMNLP-
     IJCNLP), pp. 4207–4216, Association for Compu-
     tational Linguistics, Hong Kong, China (Nov 2019)
[22] Zoph, B., Yuret, D., May, J., Knight, K.: Transfer
     learning for low-resource neural machine translation.
     In: Proceedings of the 2016 Conference on EMNLP,
     pp. 1568–1575, Association for Computational Lin-
     guistics, Austin, Texas (Nov 2016)