MAGNET - MAchines GeNErating Translations:
                         A CALAMITA Challenge
                         Mauro Cettolo1,*,† , Andrea Piergentili1,2,† , Sara Papi1 , Marco Gaido1 , Matteo Negri1 and
                         Luisa Bentivogli1
                         1
                             Fondazione Bruno Kessler, Trento, Italy
                         2
                             University of Trento, Italy


                                            Abstract
                                            We propose MAGNET - MAchines GeNErating Translations, a CALAMITA Challenge which aims at testing the ability of large language
                                            models (LLMs) in the hot topic of automatic translation, focusing on Italian and English (in both directions) to overcome the marginality
                                            with which Italian is considered by the machine translation community. We propose a benchmark composed of two portions with
                                            different distribution policies (one free to use, the other not discloseable), allowing to handle data contamination issues. The publicly
                                            available section of the benchmark is distributed on Hugging Face, whereas in this report we describe the details of our challenge,
                                            including the prompt formats to be used. Additionally, we report the performance of five models, including a LLM and different sized
                                            translation models, in terms of four evaluation metrics, whose scores allow an overall evaluation of the quality of the automatically
                                            generated translations.

                                            Keywords
                                            Machine translation, English-Italian, FLORES+, Bleu, ChrF, Bleurt, Comet, Llama3-8B-Instruct, mBART50, NLLB


                         1. Introduction and Motivation                                                                                world, including Italian. On the other hand, the global MT
                                                                                                                                       market size was valued at USD 847.24 million in 2021 and is
                         Machine Translation (MT) refers to the process, carried out                                                   expected to expand at a compound annual growth rate of
                         by a computer program, of translating text from one lan-                                                      16.4% in 2024-2031, reaching USD 2107.56 million by 2027.2
                         guage to another without human involvement. The idea of                                                       Being Europe, and then Italy, one of the leading regions for
                         using digital computers to translate natural languages dates                                                  the MT market, CALAMITA [6] cannot miss MT. Therefore
                         back to the 1940s, making MT one of the oldest fields of artifi-                                              we propose the challenge of testing the LLMs ability in the
                         cial intelligence. Since then, the improvement in translation                                                 hot topic of automatic translation, focusing on Italian and
                         quality has been constant and achieved through increasingly                                                   English (in both directions) to overcome the marginality
                         effective approaches (rule-, example- and statistical-based);                                                 with which Italian is considered by the MT community.
                         however, the most significant advances have likely been
                         observed over the last few years, thanks to the introduction
                         of neural networks. Neural models specifically trained for                                                    2. Challenge: Description
                         accomplishing the translation task, like DeepL Translator,1
                         reach outstanding quality, even if the so-called human par-                                                   The MAGNET challenge provides a framework for assessing
                         ity has not been achieved yet, especially in unrestricted                                                     the ability of LLMs in translating Italian text into English and
                         domains and for language pairs not involving English. Re-                                                     vice-versa. It is organized following the blueprint of other
                         cently, an alternative neural-based method is gathering a                                                     long-standing MT shared tasks, such as those proposed
                         lot of interest due to its undoubted potential; it consists in                                                in the WMT3 and IWSLT4 conferences, where Organizers
                         prompting generative large language models (LLMs), like                                                       prepare and distribute development and test sets, define the
                         GPT models [1, 2] and the LLama model family [3, 4, 5], to                                                    training conditions, possibly providing specific training data,
                         translate a text. Whatever the approach, the MT research                                                      establish the evaluation modalities, typically via automatic
                         community is much focused on the development and vali-                                                        metrics and occasionally enriched by human evaluations,
                         dation of models covering English and few other languages,                                                    collect and evaluate participants’ submissions, and finally
                         paying little attention or completely neglecting the vast                                                     disclose the results.
                         majority of the more than 7,000 languages spoken in the                                                          The MAGNET challenge supplies a benchmark divided in
                                                                                                                                       two portions: one based on a publicly available MT bench-
                         CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec                                      mark and a private one (see Section 3). This allows par-
                         04 — 06, 2024, Pisa, Italy
                         *
                           Corresponding author.
                                                                                                                                       ticipants not only to evaluate their models but possibly to
                         †
                           These authors contributed equally.
                                                                                                                                       also fine-tune them, by exploiting the open portion of the
                         $ cettolo@fbk.eu (M. Cettolo); apiergentili@fbk.eu (A. Piergentili);                                          MAGNET benchmark for development purposes.
                         spapi@fbk.eu (S. Papi); mgaido@fbk.eu (M. Gaido); negri@fbk.eu                                                   Multiple evaluation metrics are employed so as to have a
                         (M. Negri); bentivo@fbk.eu (L. Bentivogli)                                                                    comprehensive overview of the quality of the translations
                          https://mt.fbk.eu/author/cettolo/ (M. Cettolo);                                                             generated by a specific model. Indeed, shared tasks on au-
                         https://mt.fbk.eu/author/apiergentili/ (A. Piergentili);
                                                                                                                                       tomatic metrics are still being organized,5 as evidence of
                         https://mt.fbk.eu/author/spapi/ (S. Papi);
                         https://mt.fbk.eu/author/mgaido/ (M. Gaido);                                                                  the fact that none of the metrics designed up to now by the
                         https://mt.fbk.eu/author/negri/ (M. Negri);                                                                   scientific community has proven capable of covering every
                         https://mt.fbk.eu/author/bentivogli/ (L. Bentivogli)                                                          single aspect that defines a “good” translation by itself .
                          0000-0001-8388-497X (M. Cettolo); 0000-0002-4494-8886
                         (A. Piergentili); 0000-0002-4494-8886 (S. Papi); 0000-0003-4217-1396                                          2
                                                                                                                                         https://www.linkedin.com/pulse/machine-translation-mt-market- size-
                         (M. Gaido); 0000-0002-8811-4330 (M. Negri); 0000-0001-7480-2231                                                 2024-suhoe/
                         (L. Bentivogli)                                                                                               3
                                                                                                                                         https://www2.statmt.org/wmt24/translation-task.html
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   4
                                        Attribution 4.0 International (CC BY 4.0).                                                       https://iwslt.org/2024/#shared-tasks
                         1                                                                                                             5
                             https://en.wikipedia.org/wiki/DeepL_Translator                                                              https://www2.statmt.org/wmt24/metrics-task.html


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   In addition, in order to allow for comparisons, scores mea-   to the CLOSED portion is only provided to the Organizers
sured on the translation generated by Llama3-8B-Instruct         of the task.
and a number of other models are made available (see Sec-
tion 4).                                                         3.2. Prompts
                                                                 Table 1 reports the simple prompt formats we propose. Both
3. Data description                                              contain a simple translation instruction first, followed by the
                                                                 source sentence, and then the target language translation in
We test LLMs’ ability to translate between Italian and En-       a new line. We include four iterations of this format in the
glish using a parallel corpus composed of two parts: an          actual prompts before appending the input, so as to activate
OPEN portion and a CLOSED one.
                                                                 LLMs’ in-context learning ability [1].
                                                                    Both the source and the translation are surrounded by
OPEN For the OPEN portion of the MAGNET benchmark                the characters < and >. This instructs the model to repro-
we propose FLORES+, the latest version of FLORES-2006 [7],       duce this format in its output as well. We do so to address
a multilingual MT evaluation benchmark released under CC         LLMs’ tendency to include unwanted extra comments in
BY-SA 4.0 by FAIR researchers at Meta. It consists of English    their outputs. Such comments would compromise all au-
sentences sampled in equal amounts from Wikinews (an             tomatic evaluations (see Section 4) due to the presence of
international news source), Wikijunior (a collection of age-     extra content in the candidate outputs, which is penalized
appropriate non-fiction books), and Wikivoyage (a travel         by the string-based metrics and alters the vector representa-
guide), translated into more than 200 languages, including       tions used by the model-based metrics to compute similarity
Italian. Dev and devtest sets consisting of about 1,000 seg-     scores.
ments each are provided. See Section 3.3 for statistics on
this portion of the MAGNET benchmark.
                                                                 3.3. Detailed data statistics
CLOSED The CLOSED subset is a MT test set developed              In Table 2 detailed statistics are provided on the various
by FBK by collecting texts of English and Italian news, and      sections of the benchmark in terms of number of segments
then commissioning their professional translation to a spe-      (#seg), and of English (|en|) and Italian (|it|) words.
cialized company. This resource is private and not publicly
accessible. See Section 3.3 for statistics on this portion of
the MAGNET benchmark.                                            4. Metrics
                                                                 We evaluate LLMs’ performance in translation using a set
   Both subsets allow for the evaluation of MT quality in        of four automatic metrics selected in light of the ongoing
both translation directions, i.e. English→Italian and Ital-      challenges in MT evaluation, which still pose an open prob-
ian→English. The decision to split our benchmark in two          lem. New metrics are indeed continually proposed, and
subsets is primarily motivated by their current distribution     evaluation campaigns aimed at assessing these metrics are
policy, which is inherently linked to growing concerns about     organised periodically (for example, the annual WMT Met-
data contamination [8]. Data contamination refers to the         rics Shared Task [9]). Broadly, automatic metrics can be
possibility that the input-output pairs used in LLM tests        divided into string-based metrics and metrics using pre-
occur in the huge data sets typically used for pre-training      trained models, with either group having both strengths
and fine-tuning; such overlap can lead to inflated bench-        and weaknesses [10]. Therefore, for a more comprehensive
mark scores, creating an overly favorable impression of an       translation quality evaluation accounting for their comple-
LLM’s abilities. Although it is challenging to determine with    mentarity, we propose to adopt a couple of metrics from
certainty whether the models being evaluated were trained        each group, selected among the most commonly used ones:
on popular datasets scraped from the web, this possibility
should be taken seriously. To promote sound evaluation                   • string-based: BLEU8 [11] and CHRF9 [12] via
and mitigate the effects of biased or potentially mislead-                 sacreBLEU [13]
ing results due to data contamination, one approach is to                • pretrained models-based: BLEURT [14] (check-
rely exclusively on – or at least include among the bench-                 point: BLEURT-20) and COMET [15] (model:
marks – “safe” datasets that are either private or have very               wmt22-comet-da).
controlled/limited distribution. Therefore, pairing a larger,
widely used public dataset (FLORES+) with a smaller, in-            All of them are quality metrics, that is the higher the
house dataset – the CLOSED subset – aims to strike a balance     score the better the translation. The overview of the scores
between the thoroughness and the reliability of the evalua-      from all these metrics allows for a robust assessment of the
tion.                                                            quality of individual models, and a fair comparison between
                                                                 different models as well.
3.1. Data format                                                    We provide reference performance on our challenge of
                                                                 one of the most popular open LLMs, and four state-of-the-
The datasets are organized in a parallel text format, i.e. ev-   art MT models:
ery entry is composed of a sentence in one language and the
corresponding translation. The OPEN portion of the bench-
mark is publicly available on Hugging Face,7 whereas access
                                                                 8
                                                                     sacreBLEU signature: nrefs:1|case:mixed|
6
  https://github.com/openlanguagedata/flores                         |eff:no|tok:13a|smooth:exp|version:2.0.0
7                                                                9
  https://huggingface.co/datasets/FBK-MT/                            sacreBLEU signature: nrefs:1|case:mixed|
  MAGNETbenchmark4CALAMITA24                                         |eff:yes|nc:6|nw:0|space:no|version:2.0.0
       prompt      content
                   Translate the following sentence into Italian: <On Monday, scientists from the Stanford University School of
                   Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that
                   can be manufactured using standard inkjet printers for possibly about one U.S. cent each.>
       en-it       <Nella giornata di lunedí, alcuni scienziati della Scuola di Medicina dell’Università di Stanford hanno annunciato
                   l’invenzione di un nuovo strumento diagnostico capace di ordinare le cellule in base al tipo: un chip minuscolo
                   che può essere stampato utilizzando stampanti a getto di inchiostro al costo di circa 1 centesimo di dollaro
                   l’uno.>
                   Translate the following sentence into English: <Nella giornata di lunedí, alcuni scienziati della Scuola di Medicina
                   dell’Università di Stanford hanno annunciato l’invenzione di un nuovo strumento diagnostico capace di ordinare
                   le cellule in base al tipo: un chip minuscolo che può essere stampato utilizzando stampanti a getto di inchiostro
       it-en       al costo di circa 1 centesimo di dollaro l’uno.>
                   <On Monday, scientists from the Stanford University School of Medicine announced the invention of a new
                   diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet
                   printers for possibly about one U.S. cent each.>

     Table 1
     Examples of the format of prompts proposed for MT Challenge. Prompt en-it is designed for the translation from English into
     Italian, prompt it-en for the opposite direction. In both cases, for instructing Llama3-8B-Instruct only one single shot taken
     from the OPEN dev set is shown, while in experiments of Section 4 four shots are provided to the model.


          Data         Set       #seg      |en|     |it|                  different NLLB models are available under the CC-BY-NC
                       dev        997     21.0k    23.0k
          OPEN                                                            4.0 license, which mainly differ in size, ranging from the
                       devtst    1012     21.9k    24.3k
                                                                          smallest with 600M parameters to the largest with 54.5B
                       UK         589     10.6k    11.2k
          CLOSED       US         599     10.0k    9.7k
                                                                          parameters. On the basis of their manageability and official
                       IT         547     10.8k    10.3k                  performance claimed by the authors, we decided to include
                                                                          two NLLB models in this investigation, the distilled variant
Table 2                                                                   with 1.3B parameters (NLLB_1.3B) and the one with 3.3B
Statistics of the benchmark in terms of number of segments and            parameters (NLLB_3.3B).
of (detokenized) words on English and Italian sides.
                                                                             Table 3 provides the scores measured for each model on
                                                                          all evaluation sets of the benchmark, except for the OPEN
                                                                          dev set, since we reserved that subset as the source of the
   Llama-3-8B-Instruct:10 a LLM from the Llama 3 model
                                                                          exemplars used for few-shot prompting with Llama-3-8B-
family [5]. It is an instruction-tuned model, i.e. it is fine-
                                                                          Instruct. First of all, we note that the performance of the
tuned to align its outputs with the desired response charac-
                                                                          three multilingual translation models mBART50, NLLB_1.3B
teristics [16], in this case for assistant-like chat. Therefore,
                                                                          and NLLB_3.3B are strictly in increasing order according
we provide the 4-shot prompts described in Section 3.2 as
                                                                          to their number of parameters, with respect to all metrics
input for the model in a chat format, with user role mes-
                                                                          (with only one microscopic exception). In general, Llama-3-
sages with the instruction and the input and assistant role
                                                                          8B-Instruct performs better than mBART50 and worse than
messages with the corresponding output.11
                                                                          NLLB_1.3B.
  HelsinkiMT:12 the Language Technology Research                             The behavior of HelsinkiMT is more difficult to frame:
Group at the University of Helsinki made available under                  there are cases in which it is definitely the best perform-
the CC-BY-4.0 license a set of neural MT models trained with              ing model (CLOSED-IT, it→en) or at least competitive with
MarianNMT13 on OPUS data,14 including English-Italian15                   NLLB_3.3B (CLOSED-UK, en→it; CLOSED-IT, en→it); oth-
and Italian-English16 models.                                             ers in which it is only slightly better than mBART50 (OPEN
   mBART50:17 a multilingual neural translation model                     devtst, it→en; CLOSED-US, it→en). This can probably be
that covers any pair from a set of 50 languages, English and              explained by the fact that HelsinkiMT is not a single model,
Italian included [17]. Built by Meta/Facebook on the fairseq              rather a collection of models specifically trained for cov-
toolkit,18 it is released under the MIT license. Its network              ering the translation between specific languages. That is,
has approximately 600M parameters.                                        HelsinkiMT en→it and it→en models were trained indepen-
    NLLB:19 No Language Left Behind (NLLB) is also a mul-                 dently, on different training data. Therefore, it is possible
tilingual neural translation model that covers any pair from              that their performance when compared to that of other mod-
more than 200 languages, including the two we are inter-                  els may not be consistent across the various sections of our
ested in. The code was developed by Meta/Facebook as a                    benchmark.
branch of fairseq and is released under the MIT license. Five                In summary, we can state that Llama-3-8B-Instruct, a
                                                                          general purpose, generative model only conditioned towards
10
   https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct             performing translation by four task exemplars, compares
11
   https://huggingface.co/docs/transformers/main/en/chat_templating       well to translation models; likely, fine-tuning Llama-3-8B-
12
   https://github.com/Helsinki-NLP/Opus-MT                                Instruct on the translation task could allow it to achieve
13
   https://marian-nmt.github.io/                                          even better performance. However, it should be considered
14
   https://opus.nlpl.eu/
15
   https://huggingface.co/Helsinki-NLP/opus-mt-en-it
                                                                          that this version of Llama-3-8B-Instruct – which is also the
16
   https://huggingface.co/Helsinki-NLP/opus-mt-it-en                      smallest of that model family – has 8B parameters, more
17
   https://huggingface.co/facebook/mbart-large-50                         than twice the parameters of NLLB_3.3B and an order of
18
   https://github.com/facebookresearch/fairseq                            magnitude more than mBART50.
19
   https://github.com/facebookresearch/fairseq/tree/nllb
                                                        it→en                                        en→it
           system
                                      BLEU      ChrF      BLEURT    COMET          BLEU      ChrF      BLEURT       COMET
                                                                       OPEN – devtst
           HelsinkiMT                 29.39     60.00     0.7568     0.8656    27.53         57.61      0.7422       0.8521
           mBART50                    27.34     57.64     0.7371     0.8494    23.88         54.34      0.7322       0.8502
           NLLB_1.3B                  35.08     62.42     0.7732     0.8774    29.31         58.04      0.7773       0.8749
           NLLB_3.3B                  35.03     63.04     0.7781     0.8805    29.95         58.74      0.7871       0.8811
           Llama-3-8B-Instruct        32.04     62.03     0.7778     0.8795    26.36         56.60      0.7710       0.8758
                                                                       CLOSED – UK
           HelsinkiMT                 48.06     71.78     0.8038     0.8949   57.35          76.99      0.7998       0.8836
           mBART50                    43.77     68.79     0.7789     0.8776   47.46          70.68      0.7910       0.8837
           NLLB_1.3B                  52.48     73.83     0.8072     0.8954   55.12          74.62      0.8160       0.8933
           NLLB_3.3B                  54.61     75.09     0.8096     0.8968   56.00          75.28      0.8210       0.8937
           Llama-3-8B-Instruct        46.61     71.02     0.8088     0.8985   39.29          66.50      0.7948       0.8840
                                                                       CLOSED – US
           HelsinkiMT                 39.26     62.25     0.7459     0.8571   39.02          64.41      0.7395       0.8394
           mBART50                    37.54     60.78     0.7314     0.8437   34.19          60.79      0.7309       0.8420
           NLLB_1.3B                  42.72     64.76     0.7449     0.8544   39.91          64.40      0.7580       0.8566
           NLLB_3.3B                  43.36     65.23     0.7483     0.8585  40.35           64.63      0.7681       0.8583
           Llama-3-8B-Instruct        39.08     62.53     0.7502     0.8613   28.73          58.24      0.7355       0.8469
                                                                       CLOSED – IT
           HelsinkiMT                 59.14     77.83     0.7814     0.8515  48.90           74.47      0.8278       0.8898
           mBART50                    39.00     63.98     0.7101     0.8029  37.24           66.65      0.7858       0.8679
           NLLB_1.3B                  49.17     69.88     0.7361     0.8251  46.48           72.32      0.8212       0.8896
           NLLB_3.3B                  50.33     70.67     0.7373     0.8271  47.67           73.56      0.8285       0.8928
           Llama-3-8B-Instruct        43.89     68.96     0.7660     0.8496  37.19           67.64      0.7996       0.8797
     Table 3
     Translation results on benchmark of MT models and LLMs. The best scores for each translation direction, subset, and metric
     are signalled in bold.


5. Limitations                                                       7. Data license and copyright issues
Nowadays, LLMs are trained on huge amounts of data                   The OPEN section of our benchmark is part of the FLO-
mostly crawled from the web. Therefore, as already pointed           RES+ dataset which is licensed under the Creative Commons
out in Section 3, it is hard to be sure that there is no data        Attribution Share Alike 4.0 International,20 which requires
contamination, that is no overlap between training and eval-         derivatives to be distributed under the same or a similar,
uation data. Data contamination makes the evaluation of              compatible license. We opted for the same license.
LLMs unreliable since their performance may be inflated.               There is no license associated with the CLOSED part of
   Concerning our specific case, the risk that OPEN/FLORES+          our benchmark as it is not distributed and can only be used
data are contaminated is not negligible; however the results         by CALAMITA Organizers for evaluation purposes.
shown in Table 3, which are good but realistic, do not seem
to indicate any contamination.
   In theory, the contamination risk of the CLOSED section is        Acknowledgments
lower than for the CLOSED one, since the translations of the
                                                                     The work presented in this paper is funded by the Euro-
original texts have never been released. On the other hand,
                                                                     pean Union’s Horizon research and innovation programme
original texts are available on the web (although only for
                                                                     under grant agreement No 101135798, project Meetween
private use), therefore it cannot be ruled out that the models
                                                                     (My Personal AI Mediator for Virtual MEETtings BetWEEN
“know” them, in some way. For example, the exceptionally
                                                                     People) and the PNRR project FAIR - Future AI Research
high results of HelsinkiMT on the CLOSED-IT set seem to
                                                                     (PE00000013), under the NRRP MUR program funded by the
be an anomaly, likely due to data contamination.
                                                                     NextGenerationEU.

6. Ethical issues                                                    References
Our proposal does not focus on ethically charged topics.
                                                                      [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D.
While the data we propose for the evaluation of automatic
                                                                          Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
translation may mention sensitive topics or be afflicted by
                                                                          G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
ethical issues such as social biases (e.g., gender bias), here we
                                                                          G. Krueger, T. Henighan, R. Child, A. Ramesh,
focus solely on MT quality evaluation and leave the investi-
                                                                          D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
gation of ethical aspects to other resources and analyses.
                                                                          E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
                                                                          C. Berner, S. McCandlish, A. Radford, I. Sutskever,
                                                                          D. Amodei, Language models are few-shot learners,
                                                                          in: Advances in Neural Information Processing
                                                                     20
                                                                          https://github.com/openlanguagedata/flores/blob/main/LICENSE
     Systems, volume 33, 2020, pp. 1877–1901. URL: https:              URL: https://www.aclweb.org/anthology/W18-6319.
     //proceedings.neurips.cc/paper_files/paper/2020/file/        [14] T. Sellam, D. Das, A. Parikh, BLEURT: Learning robust
     1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.                       metrics for text generation, in: Proc. of ACL, Online,
 [2] OpenAI, Gpt-4 technical report, 2024. URL: https://               2020, pp. 7881–7892. URL: https://aclanthology.org/
     arxiv.org/abs/2303.08774. arXiv:2303.08774.                       2020.acl-main.704.
 [3] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.        [15] R. Rei, J. G. C. de Souza, D. Alves, C. Zerva, A. C.
     Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro,             Farinha, T. Glushkova, A. Lavie, L. Coheur, A. F. T.
     F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lam-              Martins, COMET-22: Unbabel-IST 2022 submission
     ple, Llama: Open and efficient foundation language                for the metrics shared task, in: Proc. of WMT, Abu
     models, 2023. URL: https://arxiv.org/abs/2302.13971.              Dhabi, United Arab Emirates (Hybrid), 2022, pp. 578–
     arXiv:2302.13971.                                                 585. URL: https://aclanthology.org/2022.wmt-1.52.
 [4] H. Touvron, et al., Llama 2: Open foundation and fine-       [16] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li,
     tuned chat models, 2023. URL: https://arxiv.org/abs/              R. Hu, T. Zhang, F. Wu, G. Wang, Instruction tuning
     2307.09288. arXiv:2307.09288.                                     for large language models: A survey, 2024. URL: https:
 [5] A. Dubey, et al., The Llama 3 herd of mod-                        //arxiv.org/abs/2308.10792. arXiv:2308.10792.
     els, 2024. URL: https://arxiv.org/abs/2407.21783.            [17] Y. Tang, C. Tran, X. Li, P.-J. Chen, N. Goyal,
     arXiv:2407.21783.                                                 V. Chaudhary, J. Gu, A. Fan, Multilingual transla-
 [6] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Francis,        tion with extensible multilingual pretraining and fine-
     J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Rinaldi,           tuning, 2020. URL: https://arxiv.org/abs/2008.00401.
     D. Scalena, CALAMITA: Challenge the Abilities of                  arXiv:2008.00401.
     LAnguage Models in ITAlian, in: Proceedings of the
     10th Italian Conference on Computational Linguistics
     (CLiC-it 2024), Pisa, Italy, December 4 - December 6,
     2024, CEUR Workshop Proceedings, CEUR-WS.org,
     2024.
 [7] NLLB Team, M. R. Costa-jussà, J. Cross, O. Çelebi,
     M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi,
     J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wen-
     zek, A. Youngblood, B. Akula, L. Barrault, G. Mejia-
     Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R.
     Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews,
     N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao,
     V. Goswami, F. Guzmán, P. Koehn, A. Mourachko,
     C. Ropers, S. Saleem, H. Schwenk, J. Wang, No lan-
     guage left behind: Scaling human-centered machine
     translation, 2022. arXiv:arXiv:1902.01382.
 [8] C. Deng, Y. Zhao, X. Tang, M. Gerstein, A. Cohan,
     Investigating data contamination in modern bench-
     marks for large language models, in: Proc. of NAACL
     (Volume 1: Long Papers), Mexico City, Mexico, 2024,
     pp. 8706–8719. URL: https://aclanthology.org/2024.
     naacl-long.482.
 [9] M. Freitag, N. Mathur, C.-k. Lo, E. Avramidis, R. Rei,
     B. Thompson, T. Kocmi, F. Blain, D. Deutsch, C. Stew-
     art, C. Zerva, S. Castilho, A. Lavie, G. Foster, Re-
     sults of WMT23 metrics shared task: Metrics might
     be guilty but references are not innocent, in: Proc.
     of WMT, Singapore, 2023, pp. 578–628. URL: https:
     //aclanthology.org/2023.wmt-1.51.
[10] T. Kocmi, C. Federmann, R. Grundkiewicz, M. Junczys-
     Dowmunt, H. Matsushita, A. Menezes, To ship or not
     to ship: An extensive evaluation of automatic metrics
     for machine translation, in: Proc. of WMT, Online,
     2021, pp. 478–494. URL: https://aclanthology.org/2021.
     wmt-1.57.
[11] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a
     Method for Automatic Evaluation of Machine Trans-
     lation, in: Proc. of ACL, Philadelphia, USA, 2002, pp.
     311–318.
[12] M. Popovic, chrF: character n-gram F-score for au-
     tomatic MT evaluation, in: Proc. of WMT, Lisbon,
     Portugal, 2015, pp. 392–395. URL: https://aclanthology.
     org/W15-3049.
[13] M. Post, A Call for Clarity in Reporting BLEU Scores,
     in: Proc. of WMT, Belgium, Brussels, 2018, pp. 186–191.