MAGNET - MAchines GeNErating Translations: A CALAMITA Challenge Mauro Cettolo1,*,† , Andrea Piergentili1,2,† , Sara Papi1 , Marco Gaido1 , Matteo Negri1 and Luisa Bentivogli1 1 Fondazione Bruno Kessler, Trento, Italy 2 University of Trento, Italy Abstract We propose MAGNET - MAchines GeNErating Translations, a CALAMITA Challenge which aims at testing the ability of large language models (LLMs) in the hot topic of automatic translation, focusing on Italian and English (in both directions) to overcome the marginality with which Italian is considered by the machine translation community. We propose a benchmark composed of two portions with different distribution policies (one free to use, the other not discloseable), allowing to handle data contamination issues. The publicly available section of the benchmark is distributed on Hugging Face, whereas in this report we describe the details of our challenge, including the prompt formats to be used. Additionally, we report the performance of five models, including a LLM and different sized translation models, in terms of four evaluation metrics, whose scores allow an overall evaluation of the quality of the automatically generated translations. Keywords Machine translation, English-Italian, FLORES+, Bleu, ChrF, Bleurt, Comet, Llama3-8B-Instruct, mBART50, NLLB 1. Introduction and Motivation world, including Italian. On the other hand, the global MT market size was valued at USD 847.24 million in 2021 and is Machine Translation (MT) refers to the process, carried out expected to expand at a compound annual growth rate of by a computer program, of translating text from one lan- 16.4% in 2024-2031, reaching USD 2107.56 million by 2027.2 guage to another without human involvement. The idea of Being Europe, and then Italy, one of the leading regions for using digital computers to translate natural languages dates the MT market, CALAMITA [6] cannot miss MT. Therefore back to the 1940s, making MT one of the oldest fields of artifi- we propose the challenge of testing the LLMs ability in the cial intelligence. Since then, the improvement in translation hot topic of automatic translation, focusing on Italian and quality has been constant and achieved through increasingly English (in both directions) to overcome the marginality effective approaches (rule-, example- and statistical-based); with which Italian is considered by the MT community. however, the most significant advances have likely been observed over the last few years, thanks to the introduction of neural networks. Neural models specifically trained for 2. Challenge: Description accomplishing the translation task, like DeepL Translator,1 reach outstanding quality, even if the so-called human par- The MAGNET challenge provides a framework for assessing ity has not been achieved yet, especially in unrestricted the ability of LLMs in translating Italian text into English and domains and for language pairs not involving English. Re- vice-versa. It is organized following the blueprint of other cently, an alternative neural-based method is gathering a long-standing MT shared tasks, such as those proposed lot of interest due to its undoubted potential; it consists in in the WMT3 and IWSLT4 conferences, where Organizers prompting generative large language models (LLMs), like prepare and distribute development and test sets, define the GPT models [1, 2] and the LLama model family [3, 4, 5], to training conditions, possibly providing specific training data, translate a text. Whatever the approach, the MT research establish the evaluation modalities, typically via automatic community is much focused on the development and vali- metrics and occasionally enriched by human evaluations, dation of models covering English and few other languages, collect and evaluate participants’ submissions, and finally paying little attention or completely neglecting the vast disclose the results. majority of the more than 7,000 languages spoken in the The MAGNET challenge supplies a benchmark divided in two portions: one based on a publicly available MT bench- CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec mark and a private one (see Section 3). This allows par- 04 — 06, 2024, Pisa, Italy * Corresponding author. ticipants not only to evaluate their models but possibly to † These authors contributed equally. also fine-tune them, by exploiting the open portion of the $ cettolo@fbk.eu (M. Cettolo); apiergentili@fbk.eu (A. Piergentili); MAGNET benchmark for development purposes. spapi@fbk.eu (S. Papi); mgaido@fbk.eu (M. Gaido); negri@fbk.eu Multiple evaluation metrics are employed so as to have a (M. Negri); bentivo@fbk.eu (L. Bentivogli) comprehensive overview of the quality of the translations € https://mt.fbk.eu/author/cettolo/ (M. Cettolo); generated by a specific model. Indeed, shared tasks on au- https://mt.fbk.eu/author/apiergentili/ (A. Piergentili); tomatic metrics are still being organized,5 as evidence of https://mt.fbk.eu/author/spapi/ (S. Papi); https://mt.fbk.eu/author/mgaido/ (M. Gaido); the fact that none of the metrics designed up to now by the https://mt.fbk.eu/author/negri/ (M. Negri); scientific community has proven capable of covering every https://mt.fbk.eu/author/bentivogli/ (L. Bentivogli) single aspect that defines a “good” translation by itself .  0000-0001-8388-497X (M. Cettolo); 0000-0002-4494-8886 (A. Piergentili); 0000-0002-4494-8886 (S. Papi); 0000-0003-4217-1396 2 https://www.linkedin.com/pulse/machine-translation-mt-market- size- (M. Gaido); 0000-0002-8811-4330 (M. Negri); 0000-0001-7480-2231 2024-suhoe/ (L. Bentivogli) 3 https://www2.statmt.org/wmt24/translation-task.html © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 4 Attribution 4.0 International (CC BY 4.0). https://iwslt.org/2024/#shared-tasks 1 5 https://en.wikipedia.org/wiki/DeepL_Translator https://www2.statmt.org/wmt24/metrics-task.html CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In addition, in order to allow for comparisons, scores mea- to the CLOSED portion is only provided to the Organizers sured on the translation generated by Llama3-8B-Instruct of the task. and a number of other models are made available (see Sec- tion 4). 3.2. Prompts Table 1 reports the simple prompt formats we propose. Both 3. Data description contain a simple translation instruction first, followed by the source sentence, and then the target language translation in We test LLMs’ ability to translate between Italian and En- a new line. We include four iterations of this format in the glish using a parallel corpus composed of two parts: an actual prompts before appending the input, so as to activate OPEN portion and a CLOSED one. LLMs’ in-context learning ability [1]. Both the source and the translation are surrounded by OPEN For the OPEN portion of the MAGNET benchmark the characters < and >. This instructs the model to repro- we propose FLORES+, the latest version of FLORES-2006 [7], duce this format in its output as well. We do so to address a multilingual MT evaluation benchmark released under CC LLMs’ tendency to include unwanted extra comments in BY-SA 4.0 by FAIR researchers at Meta. It consists of English their outputs. Such comments would compromise all au- sentences sampled in equal amounts from Wikinews (an tomatic evaluations (see Section 4) due to the presence of international news source), Wikijunior (a collection of age- extra content in the candidate outputs, which is penalized appropriate non-fiction books), and Wikivoyage (a travel by the string-based metrics and alters the vector representa- guide), translated into more than 200 languages, including tions used by the model-based metrics to compute similarity Italian. Dev and devtest sets consisting of about 1,000 seg- scores. ments each are provided. See Section 3.3 for statistics on this portion of the MAGNET benchmark. 3.3. Detailed data statistics CLOSED The CLOSED subset is a MT test set developed In Table 2 detailed statistics are provided on the various by FBK by collecting texts of English and Italian news, and sections of the benchmark in terms of number of segments then commissioning their professional translation to a spe- (#seg), and of English (|en|) and Italian (|it|) words. cialized company. This resource is private and not publicly accessible. See Section 3.3 for statistics on this portion of the MAGNET benchmark. 4. Metrics We evaluate LLMs’ performance in translation using a set Both subsets allow for the evaluation of MT quality in of four automatic metrics selected in light of the ongoing both translation directions, i.e. English→Italian and Ital- challenges in MT evaluation, which still pose an open prob- ian→English. The decision to split our benchmark in two lem. New metrics are indeed continually proposed, and subsets is primarily motivated by their current distribution evaluation campaigns aimed at assessing these metrics are policy, which is inherently linked to growing concerns about organised periodically (for example, the annual WMT Met- data contamination [8]. Data contamination refers to the rics Shared Task [9]). Broadly, automatic metrics can be possibility that the input-output pairs used in LLM tests divided into string-based metrics and metrics using pre- occur in the huge data sets typically used for pre-training trained models, with either group having both strengths and fine-tuning; such overlap can lead to inflated bench- and weaknesses [10]. Therefore, for a more comprehensive mark scores, creating an overly favorable impression of an translation quality evaluation accounting for their comple- LLM’s abilities. Although it is challenging to determine with mentarity, we propose to adopt a couple of metrics from certainty whether the models being evaluated were trained each group, selected among the most commonly used ones: on popular datasets scraped from the web, this possibility should be taken seriously. To promote sound evaluation • string-based: BLEU8 [11] and CHRF9 [12] via and mitigate the effects of biased or potentially mislead- sacreBLEU [13] ing results due to data contamination, one approach is to • pretrained models-based: BLEURT [14] (check- rely exclusively on – or at least include among the bench- point: BLEURT-20) and COMET [15] (model: marks – “safe” datasets that are either private or have very wmt22-comet-da). controlled/limited distribution. Therefore, pairing a larger, widely used public dataset (FLORES+) with a smaller, in- All of them are quality metrics, that is the higher the house dataset – the CLOSED subset – aims to strike a balance score the better the translation. The overview of the scores between the thoroughness and the reliability of the evalua- from all these metrics allows for a robust assessment of the tion. quality of individual models, and a fair comparison between different models as well. 3.1. Data format We provide reference performance on our challenge of one of the most popular open LLMs, and four state-of-the- The datasets are organized in a parallel text format, i.e. ev- art MT models: ery entry is composed of a sentence in one language and the corresponding translation. The OPEN portion of the bench- mark is publicly available on Hugging Face,7 whereas access 8 sacreBLEU signature: nrefs:1|case:mixed| 6 https://github.com/openlanguagedata/flores |eff:no|tok:13a|smooth:exp|version:2.0.0 7 9 https://huggingface.co/datasets/FBK-MT/ sacreBLEU signature: nrefs:1|case:mixed| MAGNETbenchmark4CALAMITA24 |eff:yes|nc:6|nw:0|space:no|version:2.0.0 prompt content Translate the following sentence into Italian: en-it Translate the following sentence into English: Table 1 Examples of the format of prompts proposed for MT Challenge. Prompt en-it is designed for the translation from English into Italian, prompt it-en for the opposite direction. In both cases, for instructing Llama3-8B-Instruct only one single shot taken from the OPEN dev set is shown, while in experiments of Section 4 four shots are provided to the model. Data Set #seg |en| |it| different NLLB models are available under the CC-BY-NC dev 997 21.0k 23.0k OPEN 4.0 license, which mainly differ in size, ranging from the devtst 1012 21.9k 24.3k smallest with 600M parameters to the largest with 54.5B UK 589 10.6k 11.2k CLOSED US 599 10.0k 9.7k parameters. On the basis of their manageability and official IT 547 10.8k 10.3k performance claimed by the authors, we decided to include two NLLB models in this investigation, the distilled variant Table 2 with 1.3B parameters (NLLB_1.3B) and the one with 3.3B Statistics of the benchmark in terms of number of segments and parameters (NLLB_3.3B). of (detokenized) words on English and Italian sides. Table 3 provides the scores measured for each model on all evaluation sets of the benchmark, except for the OPEN dev set, since we reserved that subset as the source of the Llama-3-8B-Instruct:10 a LLM from the Llama 3 model exemplars used for few-shot prompting with Llama-3-8B- family [5]. It is an instruction-tuned model, i.e. it is fine- Instruct. First of all, we note that the performance of the tuned to align its outputs with the desired response charac- three multilingual translation models mBART50, NLLB_1.3B teristics [16], in this case for assistant-like chat. Therefore, and NLLB_3.3B are strictly in increasing order according we provide the 4-shot prompts described in Section 3.2 as to their number of parameters, with respect to all metrics input for the model in a chat format, with user role mes- (with only one microscopic exception). In general, Llama-3- sages with the instruction and the input and assistant role 8B-Instruct performs better than mBART50 and worse than messages with the corresponding output.11 NLLB_1.3B. HelsinkiMT:12 the Language Technology Research The behavior of HelsinkiMT is more difficult to frame: Group at the University of Helsinki made available under there are cases in which it is definitely the best perform- the CC-BY-4.0 license a set of neural MT models trained with ing model (CLOSED-IT, it→en) or at least competitive with MarianNMT13 on OPUS data,14 including English-Italian15 NLLB_3.3B (CLOSED-UK, en→it; CLOSED-IT, en→it); oth- and Italian-English16 models. ers in which it is only slightly better than mBART50 (OPEN mBART50:17 a multilingual neural translation model devtst, it→en; CLOSED-US, it→en). This can probably be that covers any pair from a set of 50 languages, English and explained by the fact that HelsinkiMT is not a single model, Italian included [17]. Built by Meta/Facebook on the fairseq rather a collection of models specifically trained for cov- toolkit,18 it is released under the MIT license. Its network ering the translation between specific languages. That is, has approximately 600M parameters. HelsinkiMT en→it and it→en models were trained indepen- NLLB:19 No Language Left Behind (NLLB) is also a mul- dently, on different training data. Therefore, it is possible tilingual neural translation model that covers any pair from that their performance when compared to that of other mod- more than 200 languages, including the two we are inter- els may not be consistent across the various sections of our ested in. The code was developed by Meta/Facebook as a benchmark. branch of fairseq and is released under the MIT license. Five In summary, we can state that Llama-3-8B-Instruct, a general purpose, generative model only conditioned towards 10 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct performing translation by four task exemplars, compares 11 https://huggingface.co/docs/transformers/main/en/chat_templating well to translation models; likely, fine-tuning Llama-3-8B- 12 https://github.com/Helsinki-NLP/Opus-MT Instruct on the translation task could allow it to achieve 13 https://marian-nmt.github.io/ even better performance. However, it should be considered 14 https://opus.nlpl.eu/ 15 https://huggingface.co/Helsinki-NLP/opus-mt-en-it that this version of Llama-3-8B-Instruct – which is also the 16 https://huggingface.co/Helsinki-NLP/opus-mt-it-en smallest of that model family – has 8B parameters, more 17 https://huggingface.co/facebook/mbart-large-50 than twice the parameters of NLLB_3.3B and an order of 18 https://github.com/facebookresearch/fairseq magnitude more than mBART50. 19 https://github.com/facebookresearch/fairseq/tree/nllb it→en en→it system BLEU ChrF BLEURT COMET BLEU ChrF BLEURT COMET OPEN – devtst HelsinkiMT 29.39 60.00 0.7568 0.8656 27.53 57.61 0.7422 0.8521 mBART50 27.34 57.64 0.7371 0.8494 23.88 54.34 0.7322 0.8502 NLLB_1.3B 35.08 62.42 0.7732 0.8774 29.31 58.04 0.7773 0.8749 NLLB_3.3B 35.03 63.04 0.7781 0.8805 29.95 58.74 0.7871 0.8811 Llama-3-8B-Instruct 32.04 62.03 0.7778 0.8795 26.36 56.60 0.7710 0.8758 CLOSED – UK HelsinkiMT 48.06 71.78 0.8038 0.8949 57.35 76.99 0.7998 0.8836 mBART50 43.77 68.79 0.7789 0.8776 47.46 70.68 0.7910 0.8837 NLLB_1.3B 52.48 73.83 0.8072 0.8954 55.12 74.62 0.8160 0.8933 NLLB_3.3B 54.61 75.09 0.8096 0.8968 56.00 75.28 0.8210 0.8937 Llama-3-8B-Instruct 46.61 71.02 0.8088 0.8985 39.29 66.50 0.7948 0.8840 CLOSED – US HelsinkiMT 39.26 62.25 0.7459 0.8571 39.02 64.41 0.7395 0.8394 mBART50 37.54 60.78 0.7314 0.8437 34.19 60.79 0.7309 0.8420 NLLB_1.3B 42.72 64.76 0.7449 0.8544 39.91 64.40 0.7580 0.8566 NLLB_3.3B 43.36 65.23 0.7483 0.8585 40.35 64.63 0.7681 0.8583 Llama-3-8B-Instruct 39.08 62.53 0.7502 0.8613 28.73 58.24 0.7355 0.8469 CLOSED – IT HelsinkiMT 59.14 77.83 0.7814 0.8515 48.90 74.47 0.8278 0.8898 mBART50 39.00 63.98 0.7101 0.8029 37.24 66.65 0.7858 0.8679 NLLB_1.3B 49.17 69.88 0.7361 0.8251 46.48 72.32 0.8212 0.8896 NLLB_3.3B 50.33 70.67 0.7373 0.8271 47.67 73.56 0.8285 0.8928 Llama-3-8B-Instruct 43.89 68.96 0.7660 0.8496 37.19 67.64 0.7996 0.8797 Table 3 Translation results on benchmark of MT models and LLMs. The best scores for each translation direction, subset, and metric are signalled in bold. 5. Limitations 7. Data license and copyright issues Nowadays, LLMs are trained on huge amounts of data The OPEN section of our benchmark is part of the FLO- mostly crawled from the web. Therefore, as already pointed RES+ dataset which is licensed under the Creative Commons out in Section 3, it is hard to be sure that there is no data Attribution Share Alike 4.0 International,20 which requires contamination, that is no overlap between training and eval- derivatives to be distributed under the same or a similar, uation data. Data contamination makes the evaluation of compatible license. We opted for the same license. LLMs unreliable since their performance may be inflated. There is no license associated with the CLOSED part of Concerning our specific case, the risk that OPEN/FLORES+ our benchmark as it is not distributed and can only be used data are contaminated is not negligible; however the results by CALAMITA Organizers for evaluation purposes. shown in Table 3, which are good but realistic, do not seem to indicate any contamination. In theory, the contamination risk of the CLOSED section is Acknowledgments lower than for the CLOSED one, since the translations of the The work presented in this paper is funded by the Euro- original texts have never been released. On the other hand, pean Union’s Horizon research and innovation programme original texts are available on the web (although only for under grant agreement No 101135798, project Meetween private use), therefore it cannot be ruled out that the models (My Personal AI Mediator for Virtual MEETtings BetWEEN “know” them, in some way. For example, the exceptionally People) and the PNRR project FAIR - Future AI Research high results of HelsinkiMT on the CLOSED-IT set seem to (PE00000013), under the NRRP MUR program funded by the be an anomaly, likely due to data contamination. NextGenerationEU. 6. Ethical issues References Our proposal does not focus on ethically charged topics. [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. While the data we propose for the evaluation of automatic Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, translation may mention sensitive topics or be afflicted by G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, ethical issues such as social biases (e.g., gender bias), here we G. Krueger, T. Henighan, R. Child, A. Ramesh, focus solely on MT quality evaluation and leave the investi- D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, gation of ethical aspects to other resources and analyses. E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners, in: Advances in Neural Information Processing 20 https://github.com/openlanguagedata/flores/blob/main/LICENSE Systems, volume 33, 2020, pp. 1877–1901. URL: https: URL: https://www.aclweb.org/anthology/W18-6319. //proceedings.neurips.cc/paper_files/paper/2020/file/ [14] T. Sellam, D. Das, A. Parikh, BLEURT: Learning robust 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. metrics for text generation, in: Proc. of ACL, Online, [2] OpenAI, Gpt-4 technical report, 2024. URL: https:// 2020, pp. 7881–7892. URL: https://aclanthology.org/ arxiv.org/abs/2303.08774. arXiv:2303.08774. 2020.acl-main.704. [3] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. [15] R. Rei, J. G. C. de Souza, D. Alves, C. Zerva, A. C. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, Farinha, T. Glushkova, A. Lavie, L. Coheur, A. F. T. F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lam- Martins, COMET-22: Unbabel-IST 2022 submission ple, Llama: Open and efficient foundation language for the metrics shared task, in: Proc. of WMT, Abu models, 2023. URL: https://arxiv.org/abs/2302.13971. Dhabi, United Arab Emirates (Hybrid), 2022, pp. 578– arXiv:2302.13971. 585. URL: https://aclanthology.org/2022.wmt-1.52. [4] H. Touvron, et al., Llama 2: Open foundation and fine- [16] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, tuned chat models, 2023. URL: https://arxiv.org/abs/ R. Hu, T. Zhang, F. Wu, G. Wang, Instruction tuning 2307.09288. arXiv:2307.09288. for large language models: A survey, 2024. URL: https: [5] A. Dubey, et al., The Llama 3 herd of mod- //arxiv.org/abs/2308.10792. arXiv:2308.10792. els, 2024. URL: https://arxiv.org/abs/2407.21783. [17] Y. Tang, C. Tran, X. Li, P.-J. Chen, N. Goyal, arXiv:2407.21783. V. Chaudhary, J. Gu, A. Fan, Multilingual transla- [6] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Francis, tion with extensible multilingual pretraining and fine- J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Rinaldi, tuning, 2020. URL: https://arxiv.org/abs/2008.00401. D. Scalena, CALAMITA: Challenge the Abilities of arXiv:2008.00401. LAnguage Models in ITAlian, in: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), Pisa, Italy, December 4 - December 6, 2024, CEUR Workshop Proceedings, CEUR-WS.org, 2024. [7] NLLB Team, M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wen- zek, A. Youngblood, B. Akula, L. Barrault, G. Mejia- Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R. Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. Guzmán, P. Koehn, A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, J. Wang, No lan- guage left behind: Scaling human-centered machine translation, 2022. arXiv:arXiv:1902.01382. [8] C. Deng, Y. Zhao, X. Tang, M. Gerstein, A. Cohan, Investigating data contamination in modern bench- marks for large language models, in: Proc. of NAACL (Volume 1: Long Papers), Mexico City, Mexico, 2024, pp. 8706–8719. URL: https://aclanthology.org/2024. naacl-long.482. [9] M. Freitag, N. Mathur, C.-k. Lo, E. Avramidis, R. Rei, B. Thompson, T. Kocmi, F. Blain, D. Deutsch, C. Stew- art, C. Zerva, S. Castilho, A. Lavie, G. Foster, Re- sults of WMT23 metrics shared task: Metrics might be guilty but references are not innocent, in: Proc. of WMT, Singapore, 2023, pp. 578–628. URL: https: //aclanthology.org/2023.wmt-1.51. [10] T. Kocmi, C. Federmann, R. Grundkiewicz, M. Junczys- Dowmunt, H. Matsushita, A. Menezes, To ship or not to ship: An extensive evaluation of automatic metrics for machine translation, in: Proc. of WMT, Online, 2021, pp. 478–494. URL: https://aclanthology.org/2021. wmt-1.57. [11] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a Method for Automatic Evaluation of Machine Trans- lation, in: Proc. of ACL, Philadelphia, USA, 2002, pp. 311–318. [12] M. Popovic, chrF: character n-gram F-score for au- tomatic MT evaluation, in: Proc. of WMT, Lisbon, Portugal, 2015, pp. 392–395. URL: https://aclanthology. org/W15-3049. [13] M. Post, A Call for Clarity in Reporting BLEU Scores, in: Proc. of WMT, Belgium, Brussels, 2018, pp. 186–191.