=Paper=
{{Paper
|id=Vol-2962/paper26
|storemode=property
|title=Machine Translation of Covid-19 Information Resources via Multilingual Transfer
|pdfUrl=https://ceur-ws.org/Vol-2962/paper26.pdf
|volume=Vol-2962
|authors=Ivana Kvapilíková,Ondřej Bojar
|dblpUrl=https://dblp.org/rec/conf/itat/KvapilikovaB21
}}
==Machine Translation of Covid-19 Information Resources via Multilingual Transfer ==
Machine Translation of Covid-19 Information Resources via Multilingual Transfer Ivana Kvapilíková and Ondřej Bojar Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Malostranské náměstí 25, 118 00 Prague, Czech Republic@ufal.mff.cuni.cz Abstract: The Covid-19 pandemic has created a global de- system using multilingual corpora. Section 3 describes mand for accurate and up-to-date information which often our training data and section 4 gives more details about originates in English and needs to be translated. To train our MT systems and presents the results. Section 5 con- a machine translation system for such a narrow topic, we cludes the paper. leverage in-domain training data in other languages both from related and unrelated language families. We exper- iment with different transfer learning schedules and ob- 2 Methodology serve that transferring via more than one auxiliary lan- guage brings the most improvement. We compare the per- In this section we outline several strategies applicable in formance with joint multilingual training and report supe- the situation where we need to translate from English to rior results of the transfer learning approach. multiple languages, we are confined within a specific do- main and we have mid-size parallel corpora for every lan- guage pair of interest. 1 Introduction Firstly, we can train a standard MT system from scratch for each language separately, possibly resorting to some A global crisis such as the current Covid-19 pandemic re- data augmentation method, e.g. back-translation. Sec- quires information to be spread as efficiently as possible. ondly, we can use transfer learning to transfer from a pre- Working with data from different international resources trained MT system in one language to another. Finally, we in multiple languages can resolve possible inconsistencies can train a multilingual MT system which learns jointly and prevent misinformation. In an emergency situation, from all available data. new data is released constantly and is communicated to the In our experiments, we do not consider any additional public not only via national news and authorities, but also monolingual resources. Although monolingual data are foreign media, scientific journals or statements of interna- generally easier to obtain, we remain constrained by the tional agencies. There are extensive data resources written datasets provided by the organizers of MLIA @ Eval in English which are not accessible for non-English speak- which are described in Section 3. We also do not evaluate ers. a transfer from a large MT model pretrained on texts from In order to quickly access the information in a foreign the general domain which would be a promising strategy language, machine translation (MT) can be of great help. as well. However, Covid-related texts use a specific terminology and MT models are known to struggle outside of the gen- eral domain. 2.1 Low-resource Neural Machine Translation More than a year after the Covid outbreak, there al- ready is a significant amount of domain-specific multi- When neural machine translation (NMT) became the dom- lingual text resources. Furthermore, Covid-related texts inant paradigm in MT [19], it was believed that extremely are a part of a broader medical domain which can pro- large parallel resources are required for training. However, vide additional authentic data for training. Thanks to the Sennrich and Zhang [18] showed that with careful tuning MLIA @ Eval1 initiative who gathered training data for of the hyperparameters, an NMT model can be success- MT and information retrieval related to the pandemic, we fully trained already on 100k sentence pairs, which is less can successfully adapt an MT system to the Covid domain than we have available in the Covid/medical domain for or even train it from scratch. the language pairs of our interest. Furthermore, Conneau This paper gives an overview of possible methods to au- and Lample [4] show that translation quality can be fur- tomatically translate Covid-related texts. Section 2 out- ther boosted by pretraining a language model and using lines different approaches to train a domain-specific MT it to initialize the parameters of both the encoder and the decoder. Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). An NMT system directly trained to translate in the 1 http://eval.covid19-mlia.eu/ Covid domain serves as our baseline. de el es fr it sv Train 925,647 834,240 1,028,287 1,004,215 900,472 806,425 Dev 528 3,378 1,973 728 3,245 723 Dev Test 500 500 500 500 500 500 Blind Test 2,000 2,000 2,000 2,000 2,000 2,000 Table 1: Data summary 2.2 Data Augmentation → . . . ) or for the original parent (e.g. German → English → German → . . . ), as illustrated in Figure 1. We always Back-translation is a crucial method in NMT used to aug- generate the vocabulary from the concatenation of the par- ment training data by translating an existing monolingual ent and its "first child". When adding a third (or fourth) corpus [16]. The synthetic text can be either on the source language, the joint BPE vocabulary has to be modified by [16] or the target [21] side of the training corpus, or both replacing the original parent vocabulary entries with the [15]. new child ones. The schedules and their results are de- When using a bidirectional model (sharing the en- scribed in Section 4. coder and decoder for both translation directions), back- translation can be performed on-the-fly. During training, the model switches between the training and the inference 2.4 Multilingual Training mode to produce batches of synthetic sentence pairs and The second strategy to utilize multilingual in-domain learn from both authentic and synthetic samples in each training corpora is joint multililngual training. training step. As the system improves, the quality of the Multilingual translation systems are either trained with generated samples improves as well. This approach was full parameter sharing [1, 7, 9], with language-specific en- originally proposed for training an unsupervised MT sys- coders and decoders relying on shared attention [5] or an tem [2, 14]. attention bridge [20]. The results show that multilingual In our systems, we translate only the target sentences models yield comparable or even superior results to the and generate a synthetic source side on the fly. We do not standard bilingual setup. use any additional monolingual data for back-translation. In this work, we rely on full parameter sharing and use the same architecture as our bilingual systems, while 2.3 Transfer Learning training it to translate from English into three languages (French, Italian and Spanish) at once. During inference, The first strategy to utilize multilingual in-domain training the target language is determined from indicated language corpora is transfer learning. It can be used to transfer from embeddings of the target sentence. We selected these three a different domain [6] or a different language [12, 22]. In languages for their similarity which could help the model this work we focus on the latter. re-use and share some knowledge. The BPE vocabulary A trivial transfer learning approach was proposed by was extracted from the concatenation of all four corpora, Kocmi and Bojar [12] who fine-tune a low-resource child using only unique English sentences to reach a comparable model from a high-resource parent model pretrained for a corpus size. different language pair. The training procedure consists of first training an NMT model on the parent parallel corpus until it converges and then replacing the training data with 3 Data the child corpus. Before training the parent model, it is necessary to des- Covid-19 MLIA @ Eval organized a community eval- ignate some vocabulary entries for the new language. Oth- uation effort aimed at accelerating the creation of re- erwise the model would be forced to completely re-learn sources and tools for improved Multilingual Information its subword embeddings and their connections and would Access (MLIA). A part of this initiative is a competition lose its ability to transfer. Kocmi [11] shows that the best to develop the best MT system translating from English strategy is to generate the vocabulary in advance from the to several European languages: German, Modern Greek, concatenation of corpora of both the child and the par- French, Italian, Spanish and Swedish. The competition is ent language pair. However, if the child language is not incremental and so far only the first round was concluded. known prior to training the parent, it is enough to leave The parallel training data provided by the organizers for some "free" slots in the vocabulary and later fill them in the first round and used in this paper is summarized in Ta- with the vocabulary of the child language. ble 1.2 It was created based on existing corpora from the In this work, we experiment with several transfer learn- medical domain, enriched with sentences directly about ing schedules. We repeat the transfer procedure several 2 The development test set used for the final model selection was times with the child becoming the parent for either a com- obtained by cutting 500 sentences off of either the train set or the devel- pletely new language (e.g. German → English → Spanish opment set, depending on the original development set size. English → German EN → DE Train EN → SV Fine-tune EN → DE EN → DE EN → DE Fine-tune EN → ES EN → ES EN → SV EN → SV Fine-tune EN → DE EN → DE EN → DE EN → DE EN → DE Fine-tune dev BLEU 20.76 21.26 22.60 22.55 22.50 English → Italian Train EN → DE EN → ES Fine-tune EN → DE EN → IT Fine-tune EN → ES EN → ES EN → ES Fine-tune EN → IT EN → IT EN → IT EN → IT Fine-tune dev BLEU 30.97 31.68 32.10 33.07 Figure 1: Illustration of the incremental transfer for selected languages: BLEU scores on dev set. de el es fr it sv bidirectional with BT 21.52 22.30 40.94 38.46 33.17 20.61 unidirectional without BT 20.76 22.70 40.46 35.57 30.97 19.13 Table 2: Translating from English using the baseline model and back-translation: BLEU scores on dev set. Covid, mostly harvested through web crawling and par- trained on 4 GPUs4 with 2-step gradient accumulation to allel sentence mining [3]. The sentences in different lan- reach an effective batch size of 8 × 3400 tokens. Ef- guages might be similar, but the entire corpus collection is fective batch size has a significant impact on the training not multi-parallel. and we observe that the models converge on lower BLEU All data was segmented into BPE units [17] with a vo- scores for smaller batch sizes. We used Adam [10] opti- cabulary of 30k items for the training. mizer with inverse square root decay (β1 = 0.9, β2 = 0.98, lr = 0.0001). Beam search with the beam size of 4 was used during final decoding; greedy decoding was used for 4 Experiments & Results back-translation. The vocabulary size was set to 30k. Us- We participated in the MT shared task of the Covid-19 ing larger vocabulary leads to a performance drop. All our MLIA @ Eval initiative and trained a model for translation model parameters are initialized with a pretrained masked into each of the six languages listed in Section 3. The re- language model as described in Conneau and Lample [4]. sults of our submitted systems are summarized in the pre- liminary report [13], the overall results are discussed in the 4.1 Online Back-Translation shared task findings [3]. Our English → German and En- glish → Swedish systems ranked first (tied with one other For each language pair we first trained a bidirectional system), our other models ranked second. back-translation model described in Section 2 and com- We experimented with three training strategies com- pared it to a standard unidirectional model without back- pared against one baseline BASE: translation. Online back-translation improved the score by 0.5–2.9 BLEU points, depending on the language, but sur- 1. unidirectional training without back-translation prisingly caused a decrease of 0.4 BLEU in the case of the (BASE); English–Modern Greek model. The reason for this drop is 2. bidirectional training with online back-translation likely in the bidirectionality of the model rather than the (BT); data augmentation itself. The results are summarized in 3. transfer learning (TRANSFER); Table 2. 4. multilingual training (MULTILING). We experimented with a dropout of 0.1 and 0.2 and con- For all our MT models we use a 6-layer Transformer cluded that higher dropout helps in most settings. This [19] architecture with 8 heads, embedding dimension of observation is in line with Sennrich and Zhang [18] who 1024 and GELU [8] activations. The training is performed emphasize the role of higher dropout when working with using the XLM3 toolkit. The translation models were low- to medium- sized resources. 3 https://github.com/facebookresearch/XLM 4 Quadro P5000, 16GB of RAM Transfer Combination de el es fr it sv en-es → en-de 21.26 en-de → en-es 41.28 en-de → en-es → en-de 22.60 en-de → en-es → en-fr 35.10 en-de → en-es → en-it 32.10 en-de → en-es → en-it → en-es 41.34 en-de → en-es → en-it → en-es → en-it 33.07 en-es → en-fr 32.43 en-es → en-it 31.68 en-de → en-el 23.29 en-es → en-el 20.91 en-de → en-sv 21.69 en-de → en-sv → en-de 22.55 en-de → en-sv → en-de → en-sv 20.56 en-de → en-sv → en-de → en-sv → en-de 22.50 Table 3: Translating from English using the TRANSFER models: BLEU scores on dev set. de el es fr it sv experiments are described in Table 3 and selected sched- multilingual - - 40.2 36.1 32.8 - ules are illustrated in Figure 1. best transfer 22.6 23.3 41.3 35.1 33.1 21.7 We observe that transfer learning improves the perfor- best base 21.5 22.7 40.9 38.5 33.2 20.6 mance in all cases but French, where the BASE model with BT reaches 38.5 BLEU, which is ∼ 3 BLEU points more Table 4: Translating from English using the best models than transfer learning. There is a significant overlap be- from each category: BLEU scores on dev set. tween the training sets in different languages and it is pos- sible that French does not benefit from the transfer because de el es fr it sv it does not provide enough new sentences. On the other multilingual - - 47.3 48.0 28.3 - hand, the largest improvement is seen by the language pair best transfer 31.6 24.7 47.9 47.1 28.3 30.1 with the least amount of training data, English-Swedish, best base 31.4 24.1 47.3 48.4 - 28.5 where BLEU increases by 1.1 points on the dev set and 1.6 points on the test set. Table 5: Translating from English using the best models from each category: BLEU scores on blind test set. 4.3 Multilingual Training 4.2 Transfer Learning We train a multilingual model for translation from English to French, Italian and Spanish. The model has the same We used the best-performing BASE / BT models as the architecture as our bilingual models, all parameters are parent models and continued with unidirectional training shared for all languages. Its encoder and decoder were (English → foreign language) for our transfer learning ex- first pretrained on monolingual data in all three languages periments. Since the fine-tuning is unidirectional, we can and English using the MLM criterion [4]. no longer perform online back-translation. Table 4 shows the comparison of the TRANSFER mod- We observed that it often helped to use the transfer els with a multilingual model trained jointly. We observe incrementally, having the model converge on one paral- that transfer learning yields superior results and is thus a lel corpus, switch the target language, wait for conver- more effective way to leverage multilingual data than joint gence and switch again. We hypothesize that the model multilingual training. However, there is an advantage of a benefits from seeing a larger variety of sentences. For joint model in terms of the training and storage cost. After example transferring from German to Spanish to Italian three days of training, the multilingual model can be used (32.10 BLEU) performs better than transferring directly for translation into all three languages. The initial BASE from Spanish to Italian (31.68 BLEU). The best combi- models can take between one (without BT) and five (with nation is to even repeat the Spanish-Italian transfer twice BT) days to train and fine-tuning on a child language pair (33.07 BLEU). adds around 6 hours. When translating from English to German, fine- Table 5 lists our task submissions and compares all ap- tuning the en-de BT model on English→Spanish (or proaches on the official Covid-19 MLIA @ Eval blind test English→Swedish) and switching back to English → Ger- set.5 man adds around 1 BLEU on top of the original BT model. All language combinations used in our transfer learning 5 The BLEU scores in Table 4 and Table 5 cannot be directly com- 5 Conclusion [4] Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Wallach, H., Larochelle, H., We trained several MT systems specialized in translation Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, of texts related to the topic of Covid-19 and the pandemic R. (eds.) Advances in Neural Information Processing from English to six European languages. Systems 32, pp. 7059–7069, Curran Associates, Inc. We experimented with three training approaches and we (2019) conclude that there is not a universal winner that would work the best for all language pairs. However, transfer [5] Firat, O., Cho, K., Bengio, Y.: Multi-way, mul- learning brings promising results across the board, espe- tilingual neural machine translation with a shared cially when training data is limited. We observed an inter- attention mechanism. In: Proceedings of the 2016 esting phenomenon where incremental fine-tuning on mul- Conference of the North American Chapter of the tiple languages brings additional gains, as we expose the Association for Computational Linguistics: Human model to a larger variety of training sentences. Language Technologies, pp. 866–875, Association In our setting, transferring knowledge is a more efficient for Computational Linguistics, San Diego, Califor- way to leverage multilingual data than joint training. For nia (Jun 2016) English→German, we observe that a transfer learning de- [6] Freitag, M., Al-Onaizan, Y.: Fast domain adap- tour via Spanish or Swedish improves the parent model tation for neural machine translation. CoRR itself. For English→Modern Greek, transfer learning via abs/1612.06897 (2016) German works well, despite the unrelatedness of the two languages. For English→French, on the other hand, a bidi- [7] Ha, T.L., Niehues, J., Waibel, A.: Toward multilin- rectional model with back-translation beats both multilin- gual neural machine translation with universal en- gual and transfer-based models. coder and decoder (2016) [8] Hendrycks, D., Gimpel, K.: Bridging nonlinearities Acknowledgments and stochastic regularizers with gaussian error linear units. CoRR abs/1606.08415 (2017) This study was supported in parts by the grants CZ.07.1.02/0.0/0.0/16_023/0000108 (Operational Pro- [9] Johnson, M., Schuster, M., Le, Q.V., Krikun, M., gramme – Growth Pole of the Czech Republic), Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, 19-26934X of the Czech Science Foundation, and by the M., Corrado, G., Hughes, M., Dean, J.: Google’s SVV project number 260 575. multilingual neural machine translation system: En- abling zero-shot translation. Transactions of the As- References sociation for Computational Linguistic 5, 339–351 (2017) [1] Aharoni, R., Johnson, M., Firat, O.: Massively mul- tilingual neural machine translation. In: Proceedings [10] Kingma, D.P., Ba, J.: Adam: A method for stochas- of the 2019 Conference of the North American Chap- tic optimization. In: Proceedings of the 3rd Inter- ter of the Association for Computational Linguistics: national Conference for Learning Representations Human Language Technologies, Volume 1 (Long (2015) and Short Papers), pp. 3874–3884, Association for [11] Kocmi, T.: Exploring Benefits of Transfer Learning Computational Linguistics, Minneapolis, Minnesota in Neural Machine Translation. Ph.D. thesis, Charles (Jun 2019) University (2020) [2] Artetxe, M., Labaka, G., Agirre, E., Cho, K.: Un- [12] Kocmi, T., Bojar, O.: Trivial transfer learning for supervised neural machine translation. In: Proceed- low-resource neural machine translation. In: Pro- ings of the Sixth International Conference on Learn- ceedings of the Third Conference on Machine Trans- ing Representations (April 2018) lation: Research Papers, pp. 244–252, Association [3] Casacuberta, F., Ceausu, A., Choukri, K., Deli- for Computational Linguistics, Brussels (Oct 2018) giannis, M., Domingo, M., Garcia-Martinez, [13] Kvapilíková, I.: CUNI machine translation systems M., Herranz, M., Papavassiliou, V., Piperidis, for the Covid-19 MLIA initiative (2021), URL S., Prokopidis, P., Roussis., D.: The Covid- http://eval.covid19-mlia.eu/meetings/ 19 MLIA @ Eval Initiative: Overview of the round1/report/20210114-cunimt.pdf machine translation task (2021), URL http: //eval.covid19-mlia.eu/meetings/round1/ [14] Lample, G., Denoyer, L., Ranzato, M.: Unsupervised report/20210112-task3-overview.pdf machine translation using monolingual corpora only. pared as the dev scores were calculated by authors and test scores by the In: 6th International Conference on Learning Repre- organizers. sentations (ICLR 2018) (2018) [15] Niu, X., Denkowski, M., Carpuat, M.: Bi-directional neural machine translation with synthetic parallel data. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 84–91, As- sociation for Computational Linguistics, Melbourne, Australia (Jul 2018) [16] Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the ACL (Volume 1: Long Papers), pp. 86–96, Asso- ciation for Computational Linguistics, Berlin, Ger- many (Aug 2016) [17] Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Pro- ceedings of the 54th Annual Meeting of the ACL, pp. 1715–1725, Association for Computational Linguis- tics, Berlin (Aug 2016) [18] Sennrich, R., Zhang, B.: Revisiting low-resource neural machine translation: A case study. In: Pro- ceedings of the 57th Annual Meeting of the Associa- tion for Computational Linguistics, pp. 211–221, As- sociation for Computational Linguistics, Florence, Italy (Jul 2019) [19] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vish- wanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 6000–6010, Curran Associates, Inc. (2017) [20] Vázquez, R., Raganato, A., Tiedemann, J., Creutz, M.: Multilingual NMT with a language-independent attention bridge. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 33–39, Association for Com- putational Linguistics, Florence, Italy (Aug 2019) [21] Wu, L., Wang, Y., Xia, Y., Qin, T., Lai, J., Liu, T.Y.: Exploiting monolingual data at scale for neural ma- chine translation. In: Proceedings of the 2019 Con- ference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- IJCNLP), pp. 4207–4216, Association for Compu- tational Linguistics, Hong Kong, China (Nov 2019) [22] Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural machine translation. In: Proceedings of the 2016 Conference on EMNLP, pp. 1568–1575, Association for Computational Lin- guistics, Austin, Texas (Nov 2016)