MAGNET -MAchines GeNErating Translations: A CALAMITA Challenge

MAGNET -MAchines GeNErating Translations: A CALAMITA Challenge MauroCettolo cettolo@fbk.eu Fondazione Bruno Kessler

Trento Italy

AndreaPiergentili apiergentili@fbk.eu Fondazione Bruno Kessler

Trento Italy

University of Trento

Italy

SaraPapi spapi@fbk.eu Fondazione Bruno Kessler

Trento Italy

MarcoGaido mgaido@fbk.eu Fondazione Bruno Kessler

Trento Italy

MatteoNegri negri@fbk.eu Fondazione Bruno Kessler

Trento Italy

LuisaBentivogli Fondazione Bruno Kessler

Trento Italy

MAGNET -MAchines GeNErating Translations: A CALAMITA Challenge 1613-0073 BCB59BD7DA50847F59D83B8DD93C5877 GROBID - A machine learning software for extracting information from scholarly documents Machine translation, English-Italian, FLORES+, Bleu, ChrF, Bleurt, Comet, Llama3-8B-Instruct, mBART50, NLLB L. Bentivogli) 0000-0001-8388-497X (M. Cettolo) 0000-0002-4494-8886 (A. Piergentili) 0000-0002-4494-8886 (S. Papi) 0000-0003-4217-1396 (M. Gaido) 0000-0002-8811-4330 (M. Negri) 0000-0001-7480-2231 (L. Bentivogli)

We propose MAGNET -MAchines GeNErating Translations, a CALAMITA Challenge which aims at testing the ability of large language models (LLMs) in the hot topic of automatic translation, focusing on Italian and English (in both directions) to overcome the marginality with which Italian is considered by the machine translation community. We propose a benchmark composed of two portions with different distribution policies (one free to use, the other not discloseable), allowing to handle data contamination issues. The publicly available section of the benchmark is distributed on Hugging Face, whereas in this report we describe the details of our challenge, including the prompt formats to be used. Additionally, we report the performance of five models, including a LLM and different sized translation models, in terms of four evaluation metrics, whose scores allow an overall evaluation of the quality of the automatically generated translations.

Introduction and Motivation

Machine Translation (MT) refers to the process, carried out by a computer program, of translating text from one language to another without human involvement. The idea of using digital computers to translate natural languages dates back to the 1940s, making MT one of the oldest fields of artificial intelligence. Since then, the improvement in translation quality has been constant and achieved through increasingly effective approaches (rule-, example-and statistical-based); however, the most significant advances have likely been observed over the last few years, thanks to the introduction of neural networks. Neural models specifically trained for accomplishing the translation task, like DeepL Translator, 1reach outstanding quality, even if the so-called human parity has not been achieved yet, especially in unrestricted domains and for language pairs not involving English. Recently, an alternative neural-based method is gathering a lot of interest due to its undoubted potential; it consists in prompting generative large language models (LLMs), like GPT models [1,2] and the LLama model family [3,4,5], to translate a text. Whatever the approach, the MT research community is much focused on the development and validation of models covering English and few other languages, paying little attention or completely neglecting the vast majority of the more than 7,000 languages spoken in the world, including Italian. On the other hand, the global MT market size was valued at USD 847.24 million in 2021 and is expected to expand at a compound annual growth rate of 16.4% in 2024-2031, reaching USD 2107.56 million by 2027. 2 Being Europe, and then Italy, one of the leading regions for the MT market, CALAMITA [6] cannot miss MT. Therefore we propose the challenge of testing the LLMs ability in the hot topic of automatic translation, focusing on Italian and English (in both directions) to overcome the marginality with which Italian is considered by the MT community.

Challenge: Description

The MAGNET challenge provides a framework for assessing the ability of LLMs in translating Italian text into English and vice-versa. It is organized following the blueprint of other long-standing MT shared tasks, such as those proposed in the WMT 3 and IWSLT 4 conferences, where Organizers prepare and distribute development and test sets, define the training conditions, possibly providing specific training data, establish the evaluation modalities, typically via automatic metrics and occasionally enriched by human evaluations, collect and evaluate participants' submissions, and finally disclose the results.

The MAGNET challenge supplies a benchmark divided in two portions: one based on a publicly available MT benchmark and a private one (see Section 3). This allows participants not only to evaluate their models but possibly to also fine-tune them, by exploiting the open portion of the MAGNET benchmark for development purposes.

Multiple evaluation metrics are employed so as to have a comprehensive overview of the quality of the translations generated by a specific model. Indeed, shared tasks on automatic metrics are still being organized, 5 as evidence of the fact that none of the metrics designed up to now by the scientific community has proven capable of covering every single aspect that defines a "good" translation by itself .

In addition, in order to allow for comparisons, scores measured on the translation generated by Llama3-8B-Instruct and a number of other models are made available (see Section 4).

Data description

We test LLMs' ability to translate between Italian and English using a parallel corpus composed of two parts: an OPEN portion and a CLOSED one.

OPEN For the OPEN portion of the MAGNET benchmark we propose FLORES+, the latest version of FLORES-2006 [7], a multilingual MT evaluation benchmark released under CC BY-SA 4.0 by FAIR researchers at Meta. It consists of English sentences sampled in equal amounts from Wikinews (an international news source), Wikijunior (a collection of ageappropriate non-fiction books), and Wikivoyage (a travel guide), translated into more than 200 languages, including Italian. Dev and devtest sets consisting of about 1,000 segments each are provided. See Section 3.3 for statistics on this portion of the MAGNET benchmark.

CLOSED

The CLOSED subset is a MT test set developed by FBK by collecting texts of English and Italian news, and then commissioning their professional translation to a specialized company. This resource is private and not publicly accessible. See Section 3.3 for statistics on this portion of the MAGNET benchmark.

Both subsets allow for the evaluation of MT quality in both translation directions, i.e. English→Italian and Ital-ian→English. The decision to split our benchmark in two subsets is primarily motivated by their current distribution policy, which is inherently linked to growing concerns about data contamination [8]. Data contamination refers to the possibility that the input-output pairs used in LLM tests occur in the huge data sets typically used for pre-training and fine-tuning; such overlap can lead to inflated benchmark scores, creating an overly favorable impression of an LLM's abilities. Although it is challenging to determine with certainty whether the models being evaluated were trained on popular datasets scraped from the web, this possibility should be taken seriously. To promote sound evaluation and mitigate the effects of biased or potentially misleading results due to data contamination, one approach is to rely exclusively on -or at least include among the benchmarks -"safe" datasets that are either private or have very controlled/limited distribution. Therefore, pairing a larger, widely used public dataset (FLORES+) with a smaller, inhouse dataset -the CLOSED subset -aims to strike a balance between the thoroughness and the reliability of the evaluation.

Data format

The datasets are organized in a parallel text format, i.e. every entry is composed of a sentence in one language and the corresponding translation. The OPEN portion of the benchmark is publicly available on Hugging Face, 7 whereas access to the CLOSED portion is only provided to the Organizers of the task.

Prompts

Table 1 reports the simple prompt formats we propose. Both contain a simple translation instruction first, followed by the source sentence, and then the target language translation in a new line. We include four iterations of this format in the actual prompts before appending the input, so as to activate LLMs' in-context learning ability [1].

Both the source and the translation are surrounded by the characters < and >. This instructs the model to reproduce this format in its output as well. We do so to address LLMs' tendency to include unwanted extra comments in their outputs. Such comments would compromise all automatic evaluations (see Section 4) due to the presence of extra content in the candidate outputs, which is penalized by the string-based metrics and alters the vector representations used by the model-based metrics to compute similarity scores.

Detailed data statistics

In Table 2 detailed statistics are provided on the various sections of the benchmark in terms of number of segments (#seg), and of English (|en|) and Italian (|it|) words.

Metrics

We evaluate LLMs' performance in translation using a set of four automatic metrics selected in light of the ongoing challenges in MT evaluation, which still pose an open problem. New metrics are indeed continually proposed, and evaluation campaigns aimed at assessing these metrics are organised periodically (for example, the annual WMT Metrics Shared Task [9]). Broadly, automatic metrics can be divided into string-based metrics and metrics using pretrained models, with either group having both strengths and weaknesses [10]. Therefore, for a more comprehensive translation quality evaluation accounting for their complementarity, we propose to adopt a couple of metrics from each group, selected among the most commonly used ones:

• string-based: BLEU 8 [11] and CHRF 9 [12] via sacreBLEU [13] • pretrained models-based: BLEURT [14] (checkpoint: BLEURT-20) and COMET [15] (model:

wmt22-comet-da).

All of them are quality metrics, that is the higher the score the better the translation. The overview of the scores from all these metrics allows for a robust assessment of the quality of individual models, and a fair comparison between different models as well.

We provide reference performance on our challenge of one of the most popular open LLMs, and four state-of-theart MT models:

en-it

Translate the following sentence into Italian: <On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each.> <Nella giornata di lunedí, alcuni scienziati della Scuola di Medicina dell'Università di Stanford hanno annunciato l'invenzione di un nuovo strumento diagnostico capace di ordinare le cellule in base al tipo: un chip minuscolo che può essere stampato utilizzando stampanti a getto di inchiostro al costo di circa 1 centesimo di dollaro l'uno.>

it-en

Translate the following sentence into English: <Nella giornata di lunedí, alcuni scienziati della Scuola di Medicina dell'Università di Stanford hanno annunciato l'invenzione di un nuovo strumento diagnostico capace di ordinare le cellule in base al tipo: un chip minuscolo che può essere stampato utilizzando stampanti a getto di inchiostro al costo di circa 1 centesimo di dollaro l'uno.> <On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each.>

Table 1

Examples of the format of prompts proposed for MT Challenge. Prompt en-it is designed for the translation from English into Italian, prompt it-en for the opposite direction. In both cases, for instructing Llama3-8B-Instruct only one single shot taken from the OPEN dev set is shown, while in experiments of Section 4 four shots are provided to the model.

Data

Table 2

Statistics of the benchmark in terms of number of segments and of (detokenized) words on English and Italian sides.

Llama-3-8B-Instruct: 10 a LLM from the Llama 3 model family [5]. It is an instruction-tuned model, i.e. it is finetuned to align its outputs with the desired response characteristics [16], in this case for assistant-like chat. Therefore, we provide the 4-shot prompts described in Section 3.2 as input for the model in a chat format, with user role messages with the instruction and the input and assistant role messages with the corresponding output. 11

HelsinkiMT: 12 the Language Technology Research Group at the University of Helsinki made available under the CC-BY-4.0 license a set of neural MT models trained with MarianNMT 13 on OPUS data, 14 including English-Italian 15 and Italian-English 16 models. mBART50: 17 a multilingual neural translation model that covers any pair from a set of 50 languages, English and Italian included [17]. Built by Meta/Facebook on the fairseq toolkit, 18 it is released under the MIT license. Its network has approximately 600M parameters. NLLB: 19 No Language Left Behind (NLLB) is also a multilingual neural translation model that covers any pair from more than 200 languages, including the two we are interested in. The code was developed by Meta/Facebook as a branch of fairseq and is released under the MIT license. Five different NLLB models are available under the CC-BY-NC 4.0 license, which mainly differ in size, ranging from the smallest with 600M parameters to the largest with 54.5B parameters. On the basis of their manageability and official performance claimed by the authors, we decided to include two NLLB models in this investigation, the distilled variant with 1.3B parameters (NLLB_1.3B) and the one with 3.3B parameters (NLLB_3.3B).

Table 3 provides the scores measured for each model on all evaluation sets of the benchmark, except for the OPEN dev set, since we reserved that subset as the source of the exemplars used for few-shot prompting with Llama-3-8B-Instruct. First of all, we note that the performance of the three multilingual translation models mBART50, NLLB_1.3B and NLLB_3.3B are strictly in increasing order according to their number of parameters, with respect to all metrics (with only one microscopic exception). In general, Llama-3-8B-Instruct performs better than mBART50 and worse than NLLB_1.3B.

The behavior of HelsinkiMT is more difficult to frame: there are cases in which it is definitely the best performing model (CLOSED-IT, it→en) or at least competitive with NLLB_3.3B (CLOSED-UK, en→it; CLOSED-IT, en→it); others in which it is only slightly better than mBART50 (OPEN devtst, it→en; CLOSED-US, it→en). This can probably be explained by the fact that HelsinkiMT is not a single model, rather a collection of models specifically trained for covering the translation between specific languages. That is, HelsinkiMT en→it and it→en models were trained independently, on different training data. Therefore, it is possible that their performance when compared to that of other models may not be consistent across the various sections of our benchmark.

In summary, we can state that Llama-3-8B-Instruct, a general purpose, generative model only conditioned towards performing translation by four task exemplars, compares well to translation models; likely, fine-tuning Llama-3-8B-Instruct on the translation task could allow it to achieve even better performance. However, it should be considered that this version of Llama-3-8B-Instruct -which is also the smallest of that model family -has 8B parameters, more than twice the parameters of NLLB_3.3B and an order of magnitude more than mBART50.

Table 3

Translation results on benchmark of MT models and LLMs. The best scores for each translation direction, subset, and metric are signalled in bold.

Limitations

Nowadays, LLMs are trained on huge amounts of data mostly crawled from the web. Therefore, as already pointed out in Section 3, it is hard to be sure that there is no data contamination, that is no overlap between training and evaluation data. Data contamination makes the evaluation of LLMs unreliable since their performance may be inflated. Concerning our specific case, the risk that OPEN/FLORES+ data are contaminated is not negligible; however the results shown in Table 3, which are good but realistic, do not seem to indicate any contamination.

In theory, the contamination risk of the CLOSED section is lower than for the CLOSED one, since the translations of the original texts have never been released. On the other hand, original texts are available on the web (although only for private use), therefore it cannot be ruled out that the models "know" them, in some way. For example, the exceptionally high results of HelsinkiMT on the CLOSED-IT set seem to be an anomaly, likely due to data contamination.

Ethical issues

Our proposal does not focus on ethically charged topics. While the data we propose for the evaluation of automatic translation may mention sensitive topics or be afflicted by ethical issues such as social biases (e.g., gender bias), here we focus solely on MT quality evaluation and leave the investigation of ethical aspects to other resources and analyses.

Data license and copyright issues

The OPEN section of our benchmark is part of the FLO-RES+ dataset which is licensed under the Creative Commons Attribution Share Alike 4.0 International, 20 which requires derivatives to be distributed under the same or a similar, compatible license. We opted for the same license.

There is no license associated with the CLOSED part of our benchmark as it is not distributed and can only be used by CALAMITA Organizers for evaluation purposes.

Acknowledgments

The work presented in this paper is funded by the European Union's Horizon research and innovation programme under grant agreement No 101135798, project Meetween (My Personal AI Mediator for Virtual MEETtings BetWEEN People) and the PNRR project FAIR -Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU.

(L. Bentivogli) https://mt.fbk.eu/author/cettolo/ (M. Cettolo); https://mt.fbk.eu/author/apiergentili/ (A. Piergentili); https://mt.fbk.eu/author/spapi/ (S. Papi); https://mt.fbk.eu/author/mgaido/ (M. Gaido); https://mt.fbk.eu/author/negri/ (M. Negri); https://mt.fbk.eu/author/bentivogli/ (2024-suhoe/ 3 https://www2.statmt.org/wmt24/translation-task.html 4 https://iwslt.org/2024/#shared-tasks 5 https://www2.statmt.org/wmt24/metrics-task.html

Language models are few-shot learners TBrown BMann NRyder MSubbiah JDKaplan PDhariwal ANeelakantan PShyam GSastry AAskell SAgarwal AHerbert-Voss GKrueger THenighan RChild ARamesh DZiegler JWu CWinter CHesse MChen ESigler MLitwin SGray BChess JClark CBerner SMccandlish ARadford ISutskever DAmodei Advances in Neural Information Processing 20 2020 33 Openai Gpt-4 2024 technical report HTouvron TLavril GIzacard XMartinet M.-ALachaux TLacroix BRozière NGoyal EHambro FAzhar ARodriguez AJoulin EGrave GLample Llama: Open and efficient foundation language models 2023 Llama 2: Open foundation and finetuned chat models HTouvron 2023 The Llama 3 herd of models ADubey 2024 CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian GAttanasio PBasile FBorazio DCroce MFrancis JGili EMusacchio MNissim VPatti MRinaldi DScalena Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024) CEUR Workshop Proceedings the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)

Pisa, Italy

December 4 -December 6, 2024. 2024 MRNllb Team JCosta-Jussà OCross MÇelebi KElbayad KHeafield EHeffernan JKalbassi DLam JLicht AMaillard SSun GWang AWenzek BYoungblood LAkula GBarrault PMejia-Gonzalez JHansanti SHoffman KRJarrett DSadagopan SRowe CSpruit PTran NFAndrews SAyan SBhosale AEdunov CFan VGao FGoswami PGuzmán AKoehn CMourachko SRopers HSaleem JSchwenk Wang arXiv:1902.01382 No language left behind: Scaling human-centered machine translation 2022 Investigating data contamination in modern benchmarks for large language models CDeng YZhao XTang MGerstein ACohan Proc. of NAACL Long Papers of NAACL

Mexico City, Mexico

2024 1 Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent MFreitag NMathur C-K. Lo EAvramidis RRei BThompson TKocmi FBlain DDeutsch CStewart CZerva SCastilho ALavie GFoster Proc. of WMT of WMT

Singapore

2023 To ship or not to ship: An extensive evaluation of automatic metrics for machine translation TKocmi CFedermann RGrundkiewicz MJunczys-Dowmunt HMatsushita AMenezes Proc. of WMT, Online of WMT, Online 2021 BLEU: a Method for Automatic Evaluation of Machine Translation KPapineni SRoukos TWard W.-JZhu Proc. of ACL of ACL

Philadelphia, USA

2002 chrF: character n-gram F-score for automatic MT evaluation MPopovic Proc. of WMT of WMT

Lisbon, Portugal

2015 A Call for Clarity in Reporting BLEU Scores MPost Proc. of WMT of WMT

Belgium, Brussels

2018 BLEURT: Learning robust metrics for text generation TSellam DDas AParikh Proc. of ACL, Online of ACL, Online 2020 COMET-22: Unbabel-IST 2022 submission for the metrics shared task RRei JG CDe Souza DAlves CZerva ACFarinha TGlushkova ALavie LCoheur AF TMartins Proc. of WMT of WMT

Abu Dhabi, United Arab Emirates (Hybrid

2022 Instruction tuning for large language models: A survey SZhang LDong XLi SZhang XSun SWang JLi RHu TZhang FWu GWang 2024 Multilingual translation with extensible multilingual pretraining and finetuning YTang CTran XLi P.-JChen NGoyal VChaudhary JGu AFan 2020