On the Development of Customized Neural Machine Translation Models Mauro Cettolo, Roldano Cattoni, Marco Turchi Fondazione Bruno Kessler, Trento, Italy {cettolo,cattoni,turchi}@fbk.eu Abstract dress the problem of the quantitative and qualita- tive inadequacy of parallel data necessary to de- Recent advances in neural modeling velop translation models. Among others, deeply boosted performance of many machine investigated methods are: corpus filtering (Koehn learning applications. Training neural net- et al., 2020), data augmentation such as data works requires large amounts of clean selection (Moore and Lewis, 2010; Axelrod et data, which are rarely available; many al., 2011) and back-translation (Bertoldi and Fed- methods have been designed and inves- erico, 2009; Sennrich et al., 2016), model adapta- tigated by researchers to tackle this is- tion (Luong and Manning, 2015; Chu and Wang, sue. As a partner of a project, we were 2018). They should be the working tools of any- asked to build translation engines for the one who has to develop neural MT models for spe- weather forecast domain, relying on few, cific language pairs and domains. noisy data. Step by step, we developed This paper reports on the development of neural neural translation models, which outper- MT models for translating forecast bulletins from form by far Google Translate. This pa- German into English and Italian, and from Ital- per details our approach, that - we think ian into English and German. We were provided - is paradigmatic for a broader category of with in-domain parallel corpora for each language applications of machine learning, and as pair but not in sufficient quantity to train a neural such could be of widespread utility. model from scratch. Moreover, from the prelim- inary analysis of data, the English side resulted 1 Introduction noisy (e.g. missing or partial translations, mis- The field of machine translation (MT) has experi- aligned sentences, etc.), affecting the quality of enced significant advances in recent years thanks any pair involving that language. For this very rea- to improvements in neural modeling. On the one son, we focus on one of the pairs involving English hand, this represents a great opportunity for indus- we had to cover, namely Italian-English. trial MT, on the other it also poses the great chal- An overview of the in-domain data and the de- lenge of collecting large amounts of clean data, scription of their analysis are given in Section 2, needed to train neural networks. MT training data highlighting the issues that emerged. Section 3 de- are parallel corpora, that is collections of sentence scribes the previously listed methods together with pairs where a sentence in the source language is their employment in our specific use-case. De- paired with the corresponding translation in the veloped neural translation models are itemized in target language. Parallel corpora are typically Section 4, where their performance are compared gathered from any available source, in most cases and discussed; our best models outperform by far the web, without much guarantees about quality Google Translate and some examples will give a nor domain homogeneity. grasp of the actual translation quality. Over the years, the scientific community has We think that our approach to the specific prob- accumulated a lot of knowledge on ways to ad- lem we had to face is paradigmatic for a broader category of machine learning applications, and we Copyright © 2021 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- hope that it will be useful to the whole NLP scien- ternational (CC BY 4.0). tific community. 2 Data 2.2 Analysis and Issues We were provided with two csv files of weather As a good practice before starting the creation of forecast bulletins, issued by two different forecast MT models, data have been inspected and ana- services that from here on are identified with the lyzed looking for potential problems. Several crit- acronyms BB and TT. Each row of the BB csv con- ical issues emerged, which are described in the fol- tains, among other things, the text of the original lowing paragraphs. bulletin written in German and, possibly, its trans- Non-homogeneity of data - Since data originated lation into Italian and/or English; in the TT csv from two distinct weather forecast services (BB rows, the Italian bulletin is paired with its transla- and TT), first of all it must be established whether tion into German and/or English. they are linguistically similar and, if so, to what 2.1 Statistics extent. For this purpose, focusing on the lan- guages of the it-en benchmarks, we measured the BB Bulletins were extracted from the BB csv file perplexity of the BB and TT test sets on n-gram and paired for any possible combination of lan- language models (LMs) estimated on the BB and guages. Each bulletin is stored on a single line TT training sets:2 the closer the perplexity values but split in a few dozen fields; the average length of a given text on the two LMs, the greater the lin- of each field (about 18 German words) is appro- guistic similarity of BB and TT training sets. Ta- priate for MT systems, which process long sen- ble 3 reports values of perplexity (PP) and out-of- tences with difficulty. Table 1 shows statistics of vocabulary rates (%OOV) for all test sets vs. LMs the training and test sets for the it-en language pair. combinations.3 site task set #seg #src w #trg w LM trained on trn-nsy 30,957 626,211 505,688 BB trn TT trn BB it-en tst-nsy 20,000 376,553 298,560 PP %OOV PP %OOV tot 50,957 1,002,764 804,248 BB tst 10.8 0.22 92.0 12.07 it TT tst 42.4 0.60 10.3 0.41 Table 1: Statistics of the BB it-en benchmark. The BB tst 8.9 0.14 80.1 8.49 label nsy will be clear after reading Section 3.2. en TT tst 65.6 2.05 12.7 0.51 TT Bulletins were extracted from the TT csv file Table 3: Cross comparison of BB and TT texts. and paired for each language combination. Dif- ferently than the BB case, each TT bulletin was Overall, we can notice that the PP of the two test stored on a single line without any field split; sets significantly varies when computed on in- and since bulletins are quite long for automatic pro- out-of-domain data. The PP of any given test set is cessing (on average 30 Italian words) and are 4 (42.4 vs. 10.8) to 9 (92.0 vs. 10.3) times higher the concatenation of rather heterogeneous sen- when measured on the LM estimated on the text tences, we decided to segment them by splitting on of the other provider than on the text of the same strong punctuation. This requires a re-alignment provider. These results highlight the remarkable of source/target segments because in general they linguistic difference between the bulletins issued differ in number. The re-alignment was performed by the two forecast services. by means of the hunalign sentence aligner1 (Varga In-domain data scarcity - Current state-of-the- et al., 2005). Table 2 shows statistics of the train- art MT neural networks (Section 4.1) have dozens ing and test sets for the it-en language pair. to hundreds million parameters that have to be es- timated from data. Unfortunately, the amount of site task set #seg #src w #trg w provided data does not allow an effective estima- trn 5,177 78,834 73,763 tion from scratch of such a huge number of param- TT it-en tst 1,962 30,232 28,135 eters, as we will empirically prove in Section 4.3. tot 7,139 109,066 101,898 2 3-gram LMs with modified shift beta smoothing were es- Table 2: Statistics of the TT it-en benchmark. timated using the IRSTLM toolkit (Federico et al., 2008). 3 In order to isolate the genuine PP of the text, the dictio- nary upperbound to compute OOV word penalty was set to 0; 1 github.com/danielvarga/hunalign the OOV rates are shown for this very reason. BB English side - BB data have a major problem participants, their methods and results. For ref- on the English side. In fact, looking at csv file, erence purposes, organizers set up a competitive we realized that many German bulletins were not baseline based on LASER (Language-Agnostic translated at all into English. Moreover, in the En- SEntence Representations)4 (Schwenk and Douze, glish side there are 20% fewer words than in the 2017) multilingual sentence embeddings. The un- corresponding German or Italian sides, a differ- derlying idea is to use the cosine distance between ence that is not justified by the morpho-syntactic the embeddings of the source and the target sen- variations between languages. In fact, it happens tences to measure their parallelism. In a similar that entire portions of the original German bul- way we cleaned the BB noisy benchmark, filtering letins are not translated into English, or that a def- with a threshold of 0.9; statistics of the resulting initely more compact form is used, as in: bi-text are given in Table 4. de: Der Hochdruckeinfluss hält bis auf weiteres an. en: High pressure conditions. site task set #seg #src w #trg w trn-cln 1,673 37,629 40,256 This critical issue affects both training and test BB it-en tst-cln 1,011 20,280 21,657 sets, as highlighted by figures in Table 1; as such, tot 2,684 57,909 61,913 it negatively impacts both the quality of the trans- lation models, if trained/adapted on such noisy Table 4: Stats of the filtered BB it-en benchmark. data, and the reliability of evaluations, if run on such distorted data. A careful corpus filtering is The filtered bi-text does not suffer anymore therefore needed, as discussed in Section 3.2. from the imbalance number of words but it is 20 times smaller than the original one. 3 Methods 3.3 Data Augmentation 3.1 MT Model Adaptation Since the corpus filtering discussed in the previous A standard method for facing the in-domain data section removes most of the original data, further scarcity issue mentioned in Section 2.2 is the exacerbating the problem of data scarcity, we tried so-called fine-tuning: given a neural MT model to overcome this unwanted side effect by means of trained on a large amount of data in one domain, data augmentation methods. its parameters are tuned by continuing the train- ing using a small amount of data from another do- 3.3.1 Data Selection main (Luong and Manning, 2015; Chu and Wang, A widespreadly adopted data augmentation 2018). Though effective on the new in-domain method is data selection. Data selection assumes data supplied for model adaptation, fine-tuning the availability of a large general domain corpus typically suffers from performance drops on un- and a small in-domain corpus; in MT, the aim is to seen data (test set), unless proper regularization extract parallel sentences from the large bilingual techniques are adopted (Miceli Barone et al., corpus that are most relevant to the target domain 2017). We avoid overfitting by fine-tuning our MT as defined by the small corpus. models with dropout (set to 0.3) (Srivastava et al., On the basis of the bilingual cross-entropy dif- 2014) and performing only a limited number of ference (Axelrod et al., 2011), we sorted the sen- epochs (5) (Miceli Barone et al., 2017). tence pairs of the OPUS collection,5 used as gen- 3.2 Corpus Filtering eral domain large dataset, according to their rel- Machine learning typically requires large sets of evance to the domain determined by the concate- clean data. Since rarely large data sets are also nation of the BB and TT training sets. To estab- clean, researchers devoted much effort to data lish the optimal size of the selection, we trained cleaning, the automatic process to identify and re- LMs - created in the same setup described in non- move errors from data. The MT community is no homogeneity of data paragraph of Section 2.2 - on exception. Even, WMT - the conference on ma- increasing amounts of selected data and computed chine translation - in 2018, 2019 and 2020 edi- the PP of BB and TT test sets, separately for each tions organized a Shared Task on Parallel Corpus side. Figure 1 plots the curves; the straight lines on Filtering. Koehn et al. (2020) provide details on 4 github.com/facebookresearch/LASER 5 the task proposed in the more recent edition, on opus.nlpl.eu the bottom correspond to the PP of the same test #segments #src w #trg w sets on LMs built on the in-domain training sets. it-en 32.0M 339M 352M Table 6: Stats of the parallel generic training sets. lation into Italian of the 31k English segments of the training set (Table 1) was performed by an in-house generic en-it MT engine (details in Ap- pendix A.1 of (Bentivogli et al., 2021)). Row BT of Table 5 shows the statistics of this artifi- cial bilingual corpus; similarly to what happened with the filtering process, the numbers of Italian and English words are much more compatible than they are in the original version of the corpus. Figure 1: Perplexity of test sets on LMs estimated on increasing amounts of selected data. 4 Experimental Results The form of curves is convex, as usual in data 4.1 MT Engine selection. In our case, the best trade-off between the pertinence of data and its amount occur when The MT engine is built on the ModernMT something more than a million words is selected; framework6 which implements the Trans- therefore, we decided to mine from OPUS the former (Vaswani et al., 2017) architecture. The bilingual text whose size is given in row DS of original generic model is Big sized, as defined Table 5. Anyway, note that the lowest PP for se- in (Vaswani et al., 2017) by more than 200 lections is at least one order of magnitude greater million parameters. For training, bi-texts were than on LMs trained on in-domain training sets. downloaded from the OPUS repository5 and then filtered through the already mentioned data task set #seg #src w #trg w selection method (Axelrod et al., 2011) using a DS 206,990 1,352,623 1,312,068 general-domain seed. Statistics of the resulting it-en corpus are provided in Table 6. Training was BT 30,957 482,398 505,688 performed in the setup detailed in (Bentivogli et Table 5: Stats of selected and back translated data. al., 2021). The same Big model and its smaller variants, 3.3.2 Back Translation the Base with 50 million parameters and the Tiny Another well-known data-augmentation method, with 20 million parameters, were also trained on which somehow also represents an alternative in-domain data only for the sake of comparison. way to corpus filtering for dealing with the BB English side issue, is back-translation. Back- 4.2 MT Models translation (Bertoldi and Federico, 2009; Sennrich We empirically compared the quality of trans- et al., 2016; Edunov et al., 2018) assumes the lations generated by various MT models: two availability of an MT system from the target lan- generic, three genuine in-domain of different size guage to the source language and of target mono- and several variants of our generic model adapted lingual data. The MT system is used to translate (Section 3.1) on in-domain data resulting from the the target monolingual data into the source lan- presented methods: filtering (Section 3.2), data se- guage. The result is a parallel corpus where the lection (Section 3.3.1) and back-translation (Sec- source side is the synthetic MT output while the tion 3.3.2). Performance was measured on the target is human text. The synthetic parallel cor- BB and TT test sets in terms of BLEU (Pap- pus is then used to train or adapt a source-to-target ineni et al., 2002), TER (Snover et al., 2006) and MT system. Although simple, this method has CHRF (Popović, 2015) scores computed by means been shown to be very effective. We used back- of SacreBLEU (v1.4.14) (Post, 2018), with default translation to generate a synthetic, but hopefully 6 cleaner, version of the BB training set. The trans- github.com/modernmt/modernmt BB TT MT model noisy test set clean test set test set %BLEU↑ %TER↓ CHRF↑ %BLEU↑ %TER↓ CHRF↑ %BLEU↑ %TER↓ CHRF↑ Generic models: GT⋆ 11.45 106.61 .3502 32.59 51.72 .6104 32.20 61.56 .6315 FBK (Transformer big) 07.43 113.07 .3833 19.68 63.68 .5229 23.45 70.46 .5525 Pure in-domain models trained on BBtrn-nsy+TTtrn: Transformer tiny 23.34 83.86 .4882 35.80 61.05 .5808 42.19 51.79 .6488 Transformer base 18.39 93.41 .4590 22.06 85.91 .5237 29.17 64.73 .5351 Transformer big 20.45 95.76 .4755 24.73 89.26 .5330 28.01 68.42 .5193 FBK model adapted on: BBtrn-nsy 21.211 80.822 .47852 37.913 46.913 .6172 13.77 79.14 .4007 BBtrn-cln 10.67 108.86 .4195 31.57 52.54 .5950 27.68 65.05 .5912 TTtrn 10.44 107.48 .4241 28.64 54.20 .5800 39.61 52.64 .6702 DS 10.82 109.71 .4255 30.11 54.86 .5873 29.76 63.68 .6099 BT 12.50 106.85 .4507 34.85 49.78 .6339 32.71 58.95 .6372 BBtrn-nsy+TTtrn 19.303 79.291 .4449 32.81 52.38 .5680 40.513 51.973 .6579 BBtrn-nsy+TTtrn+DS+BT 19.362 86.333 .47921 41.171 44.671 .64882 40.692 51.842 .67343 BBtrn-cln+TTtrn 12.39 105.36 .4450 37.02 47.40 .63653 40.34 52.16 .67552 BBtrn-cln+TTtrn+DS+BT 13.75 104.59 .46193 40.092 45.282 .66171 41.161 51.011 .68031 Table 7: BLEU/TER/CHRF scores of MT models on it-en test sets. 1 , 2 and 3 indicate the “podium position” among the adapted models of each column. (⋆ ) Google Translate, as it was on 14 Sep 2021. signatures.7 (from 37.02 to 40.09) when DS and BT data are added to the adaptation corpus. 4.3 Results and Comments The fine-tuning of a Transformer big generic Scores are collected in Table 7. First, as ex- model to the weather forecast domain turned out pected (in-domain data scarcity paragraph of Sec- to be more effective than any training from scratch tion 2.2), it is not feasible to properly train a using original in-domain data only: the top per- huge number of parameters with few data; in forming model - BBtrn-cln+TTtrn+DS+BT - def- fact, the best performing pure in-domain model is initely improves the Transformer tiny with re- the smallest one (Transformer tiny). Instead, the spect to all metrics on the BB clean test set naive application of the MT state-of-the-art would (40.09/45.28/.6617 vs 35.80/61.05/.5808), and to have led to simply train a Transformer big model two metrics out of three on the TT test set (TER: on the original in-domain data. This model would 51.01 vs. 51.79, CHRF: .6803 vs. .6488). More- not have been competitive with GT on TT data over, all its scores are a lot better than those of (28.01 vs. 32.20 BLEU); it would have been on Google Translate. BB data if we had only considered the noisy test set (20.45 vs. 11.45) resulting in an important mis- 4.4 Examples interpretation of the actual quality of the two sys- To give a grasp of the actual quality of automatic tems; conversely, our preliminary analysis allowed translations, Table 8 collects the English text gen- us to discover the need of cleaning BB data, which erated by some of the tested MT models fed with a guarantees a reliable assessment (24.73 vs. 32.59). rather complex Italian source sentence. The man- Data augmentation methods (DS, BT) are both ual translations observed in BB data are shown as effective in making available additional useful bi- well: their number, their variety, some question- texts; for example, the BLEU score of the model able/wrong lexical choices in them (“high” instead BBtrn-cln+TTtrn increases by 3 absolute points of “upper-level currents”, “South-western” instead 7 of “Southwesterly”) and one totally wrong (“Weak BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a, TER+tok.tercom-nonorm-punct-noasian-uncased, high pressure conditions.”) prove the difficulty of chrF2+numchars.6+space.false learning from such data and the need to pay par- Italian source sentence: Le correnti in quota si disporranno da sudovest avvicinando masse d’aria più umida alle Alpi. Manual English translations found in BB bulletins: Weak high pressure conditions. The high currents will turn to south-west and humid air mass will reach the Alps. Southwesterly currents will bring humid air masses to South Tyrol. South-western currents will bring humid air masses to the Alps. South-westerly upper level flow will bring humid air masses towards our region. More humid air masses will reach the Alps. Humid air reaches the Alps with South-westerly winds. Automatic English translations generated by some MT models: GT The currents at high altitudes will arrange themselves from the southwest, bringing more humid air masses closer to the Alps. FBK Currents in altitude will be deployed from the southwest, bringing wet air masses closer to the Alps. Transformer tiny South-westerly upper level flow will bring humid air masses towards the Alps. BBtrn-cln+TTtrn+DS+BT The upper level flow will be arranged from the southwest approaching more humid air masses to the Alps. Table 8: Examples of manual and automatic translations. ticular attention to the evaluation phase. Concern- noisy and heterogeneous. We faced these issues ing translations, GT is able to keep most of the by exploiting a number of methods which repre- meaning of the source text but the translation is sent established knowledge of the scientific com- too literal to result in fluent English. FBK only munity: adaptation of neural models, corpus fil- partially transfers the meaning from the source tering and data augmentation techniques such as and generates a rather bad English text. Trans- data selection and back-translation. In particular, former tiny provides a very good translation both corpus filtering allowed us to avoid the misleading from a semantic and a syntactic point of view, los- results observed on the original noisy data, while ing only the negligible detail that the “air masses” adaptation and data augmentation proved useful in are “more humid”, not simply “humid”. Finally, effectively taking advantage of out-of-domain re- BBtrn-cln+TTtrn+DS+BT, the model that on the sources. basis of our evaluations is the best one, on this spe- cific example works very well at the semantic level but rather poorly on the grammatical level. References This example shows that pure in-domain mod- Amittai Axelrod, Xiaodong He, and Jianfeng Gao. els, as expected, are “more in-domain” than 2011. Domain Adaptation via Pseudo In-Domain generic models, though adapted, showing greater Data Selection. In Proc. of EMNLP, pages 355–362, Edinburgh, Scotland, UK. adherence to domain-specific language. On the other hand, according to scores in Table 7, adapted Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina models should be better in generalization. Only Karakanta, Alberto Martinelli, Matteo Negri, and subjective evaluations involving meteorologists Marco Turchi. 2021. Cascade versus Direct Speech Translation: Do the Differences Still Make a Differ- can settle the question of which model is the best. ence? In Proc. of ACL/IJCNLP (Volume 1: Long Papers), pages 2873–2887, Bangkok, Thailand. 5 Conclusions Nicola Bertoldi and Marcello Federico. 2009. Domain In this paper we described the development pro- Adaptation for Statistical Machine Translation with cess that led us to build competitive customized Monolingual Resources. In Proc. of WMT, pages 182–189, Athens, Greece. translation models. Given the provided in-domain data, we started by analyzing them under sev- Chenhui Chu and Rui Wang. 2018. A Survey of Do- eral perspectives and discovered that they are few, main Adaptation for Neural Machine Translation. In Proc. of COLING, pages 1304–1319, Santa Fe, US- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, NM. Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks Sergey Edunov, Myle Ott, Michael Auli, and David from Overfitting. Journal of Machine Learning Re- Grangier. 2018. Understanding Back-Translation search, 15(56):1929–1958. at Scale. In Proc. of EMNLP, pages 489–500, Brus- sels, Belgium. Dániel Varga, Péter Halácsy, András Kornai, Nagy Vik- tor, Nagy Laszlo, N. László, and Tron Viktor. 2005. Marcello Federico, Nicola Bertoldi, and Mauro Cet- Parallel Corpora for Medium Density Languages. tolo. 2008. IRSTLM: An Open Source Toolkit for In Proc. of RANLP, pages 590–596, Borovets, Bul- Handling Large Scale Language Models. In Proc. of garia. Interspeech, pages 1618–1621, Brisbane, Australia. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Philipp Koehn, Vishrav Chaudhary, Ahmed El-Kishky, Kaiser, and Illia Polosukhin. 2017. Attention is Naman Goyal, Peng-Jen Chen, and Francisco All You Need. In Proc. of NIPS, pages 5998–6008, Guzmán. 2020. Findings of the WMT 2020 Shared Long Beach, US-CA. Task on Parallel Corpus Filtering and Alignment. In Proc. of WMT, pages 726–742, Online. Minh-Thang Luong and Christopher D. Manning. 2015. Stanford Neural Machine Translation Sys- tems for Spoken Language Domains. In Proc. of IWSLT, pages 76–79, Da Nang, Vietnam. Antonio Valerio Miceli Barone, Barry Haddow, Ulrich Germann, and Rico Sennrich. 2017. Regulariza- tion Techniques for Fine-tuning in Neural Machine Translation. In Proc. of EMNLP, pages 1489–1494, Copenhagen, Denmark. Robert C. Moore and William Lewis. 2010. Intelli- gent Selection of Language Model Training Data. In Proc. of ACL (Short Papers), pages 220–224, Upp- sala, Sweden. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proc. of ACL, pages 311–318, Philadelphia, US-PA. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proc. of WMT, pages 392–395, Lisbon, Portugal. Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proc. of WMT, pages 186–191, Belgium, Brussels. Holger Schwenk and Matthijs Douze. 2017. Learning Joint Multilingual Sentence Representations with Neural Machine Translation. In Proc. of RepL4NLP, pages 157–167, Vancouver, Canada. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving Neural Machine Translation Mod- els with Monolingual Data. In Proc. of ACL (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea Micciulla, and John Makhoul. 2006. A Study of Translation Edit Rate with Targeted Human Annota- tion. In Proc. of AMTA, pages 223–231, Cambridge, US-MA.