<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">MAGNET -MAchines GeNErating Translations: A CALAMITA Challenge</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mauro</forename><surname>Cettolo</surname></persName>
							<email>cettolo@fbk.eu</email>
							<affiliation key="aff0">
								<orgName type="institution">Fondazione Bruno Kessler</orgName>
								<address>
									<settlement>Trento</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Andrea</forename><surname>Piergentili</surname></persName>
							<email>apiergentili@fbk.eu</email>
							<affiliation key="aff0">
								<orgName type="institution">Fondazione Bruno Kessler</orgName>
								<address>
									<settlement>Trento</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">University of Trento</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sara</forename><surname>Papi</surname></persName>
							<email>spapi@fbk.eu</email>
							<affiliation key="aff0">
								<orgName type="institution">Fondazione Bruno Kessler</orgName>
								<address>
									<settlement>Trento</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marco</forename><surname>Gaido</surname></persName>
							<email>mgaido@fbk.eu</email>
							<affiliation key="aff0">
								<orgName type="institution">Fondazione Bruno Kessler</orgName>
								<address>
									<settlement>Trento</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Matteo</forename><surname>Negri</surname></persName>
							<email>negri@fbk.eu</email>
							<affiliation key="aff0">
								<orgName type="institution">Fondazione Bruno Kessler</orgName>
								<address>
									<settlement>Trento</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Luisa</forename><surname>Bentivogli</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Fondazione Bruno Kessler</orgName>
								<address>
									<settlement>Trento</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">MAGNET -MAchines GeNErating Translations: A CALAMITA Challenge</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">BCB59BD7DA50847F59D83B8DD93C5877</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:38+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Machine translation, English-Italian, FLORES+, Bleu, ChrF, Bleurt, Comet, Llama3-8B-Instruct, mBART50, NLLB L. Bentivogli) 0000-0001-8388-497X (M. Cettolo)</term>
					<term>0000-0002-4494-8886 (A. Piergentili)</term>
					<term>0000-0002-4494-8886 (S. Papi)</term>
					<term>0000-0003-4217-1396 (M. Gaido)</term>
					<term>0000-0002-8811-4330 (M. Negri)</term>
					<term>0000-0001-7480-2231 (L. Bentivogli)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We propose MAGNET -MAchines GeNErating Translations, a CALAMITA Challenge which aims at testing the ability of large language models (LLMs) in the hot topic of automatic translation, focusing on Italian and English (in both directions) to overcome the marginality with which Italian is considered by the machine translation community. We propose a benchmark composed of two portions with different distribution policies (one free to use, the other not discloseable), allowing to handle data contamination issues. The publicly available section of the benchmark is distributed on Hugging Face, whereas in this report we describe the details of our challenge, including the prompt formats to be used. Additionally, we report the performance of five models, including a LLM and different sized translation models, in terms of four evaluation metrics, whose scores allow an overall evaluation of the quality of the automatically generated translations.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction and Motivation</head><p>Machine Translation (MT) refers to the process, carried out by a computer program, of translating text from one language to another without human involvement. The idea of using digital computers to translate natural languages dates back to the 1940s, making MT one of the oldest fields of artificial intelligence. Since then, the improvement in translation quality has been constant and achieved through increasingly effective approaches (rule-, example-and statistical-based); however, the most significant advances have likely been observed over the last few years, thanks to the introduction of neural networks. Neural models specifically trained for accomplishing the translation task, like DeepL Translator, <ref type="foot" target="#foot_0">1</ref>reach outstanding quality, even if the so-called human parity has not been achieved yet, especially in unrestricted domains and for language pairs not involving English. Recently, an alternative neural-based method is gathering a lot of interest due to its undoubted potential; it consists in prompting generative large language models (LLMs), like GPT models <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref> and the LLama model family <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref>, to translate a text. Whatever the approach, the MT research community is much focused on the development and validation of models covering English and few other languages, paying little attention or completely neglecting the vast majority of the more than 7,000 languages spoken in the world, including Italian. On the other hand, the global MT market size was valued at USD 847.24 million in 2021 and is expected to expand at a compound annual growth rate of 16.4% in 2024-2031, reaching USD 2107.56 million by 2027. 2  Being Europe, and then Italy, one of the leading regions for the MT market, CALAMITA <ref type="bibr" target="#b5">[6]</ref> cannot miss MT. Therefore we propose the challenge of testing the LLMs ability in the hot topic of automatic translation, focusing on Italian and English (in both directions) to overcome the marginality with which Italian is considered by the MT community.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Challenge: Description</head><p>The MAGNET challenge provides a framework for assessing the ability of LLMs in translating Italian text into English and vice-versa. It is organized following the blueprint of other long-standing MT shared tasks, such as those proposed in the WMT 3 and IWSLT 4 conferences, where Organizers prepare and distribute development and test sets, define the training conditions, possibly providing specific training data, establish the evaluation modalities, typically via automatic metrics and occasionally enriched by human evaluations, collect and evaluate participants' submissions, and finally disclose the results.</p><p>The MAGNET challenge supplies a benchmark divided in two portions: one based on a publicly available MT benchmark and a private one (see Section 3). This allows participants not only to evaluate their models but possibly to also fine-tune them, by exploiting the open portion of the MAGNET benchmark for development purposes.</p><p>Multiple evaluation metrics are employed so as to have a comprehensive overview of the quality of the translations generated by a specific model. Indeed, shared tasks on automatic metrics are still being organized, 5 as evidence of the fact that none of the metrics designed up to now by the scientific community has proven capable of covering every single aspect that defines a "good" translation by itself .</p><p>In addition, in order to allow for comparisons, scores measured on the translation generated by Llama3-8B-Instruct and a number of other models are made available (see Section 4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Data description</head><p>We test LLMs' ability to translate between Italian and English using a parallel corpus composed of two parts: an OPEN portion and a CLOSED one.</p><p>OPEN For the OPEN portion of the MAGNET benchmark we propose FLORES+, the latest version of FLORES-200<ref type="foot" target="#foot_1">6</ref>  <ref type="bibr" target="#b6">[7]</ref>, a multilingual MT evaluation benchmark released under CC BY-SA 4.0 by FAIR researchers at Meta. It consists of English sentences sampled in equal amounts from Wikinews (an international news source), Wikijunior (a collection of ageappropriate non-fiction books), and Wikivoyage (a travel guide), translated into more than 200 languages, including Italian. Dev and devtest sets consisting of about 1,000 segments each are provided. See Section 3.3 for statistics on this portion of the MAGNET benchmark.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CLOSED</head><p>The CLOSED subset is a MT test set developed by FBK by collecting texts of English and Italian news, and then commissioning their professional translation to a specialized company. This resource is private and not publicly accessible. See Section 3.3 for statistics on this portion of the MAGNET benchmark.</p><p>Both subsets allow for the evaluation of MT quality in both translation directions, i.e. English→Italian and Ital-ian→English. The decision to split our benchmark in two subsets is primarily motivated by their current distribution policy, which is inherently linked to growing concerns about data contamination <ref type="bibr" target="#b7">[8]</ref>. Data contamination refers to the possibility that the input-output pairs used in LLM tests occur in the huge data sets typically used for pre-training and fine-tuning; such overlap can lead to inflated benchmark scores, creating an overly favorable impression of an LLM's abilities. Although it is challenging to determine with certainty whether the models being evaluated were trained on popular datasets scraped from the web, this possibility should be taken seriously. To promote sound evaluation and mitigate the effects of biased or potentially misleading results due to data contamination, one approach is to rely exclusively on -or at least include among the benchmarks -"safe" datasets that are either private or have very controlled/limited distribution. Therefore, pairing a larger, widely used public dataset (FLORES+) with a smaller, inhouse dataset -the CLOSED subset -aims to strike a balance between the thoroughness and the reliability of the evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Data format</head><p>The datasets are organized in a parallel text format, i.e. every entry is composed of a sentence in one language and the corresponding translation. The OPEN portion of the benchmark is publicly available on Hugging Face, <ref type="foot" target="#foot_2">7</ref> whereas access to the CLOSED portion is only provided to the Organizers of the task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Prompts</head><p>Table <ref type="table">1</ref> reports the simple prompt formats we propose. Both contain a simple translation instruction first, followed by the source sentence, and then the target language translation in a new line. We include four iterations of this format in the actual prompts before appending the input, so as to activate LLMs' in-context learning ability <ref type="bibr" target="#b0">[1]</ref>.</p><p>Both the source and the translation are surrounded by the characters &lt; and &gt;. This instructs the model to reproduce this format in its output as well. We do so to address LLMs' tendency to include unwanted extra comments in their outputs. Such comments would compromise all automatic evaluations (see Section 4) due to the presence of extra content in the candidate outputs, which is penalized by the string-based metrics and alters the vector representations used by the model-based metrics to compute similarity scores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Detailed data statistics</head><p>In Table <ref type="table">2</ref> detailed statistics are provided on the various sections of the benchmark in terms of number of segments (#seg), and of English (|en|) and Italian (|it|) words.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Metrics</head><p>We evaluate LLMs' performance in translation using a set of four automatic metrics selected in light of the ongoing challenges in MT evaluation, which still pose an open problem. New metrics are indeed continually proposed, and evaluation campaigns aimed at assessing these metrics are organised periodically (for example, the annual WMT Metrics Shared Task <ref type="bibr" target="#b8">[9]</ref>). Broadly, automatic metrics can be divided into string-based metrics and metrics using pretrained models, with either group having both strengths and weaknesses <ref type="bibr" target="#b9">[10]</ref>. Therefore, for a more comprehensive translation quality evaluation accounting for their complementarity, we propose to adopt a couple of metrics from each group, selected among the most commonly used ones:</p><p>• string-based: BLEU 8 <ref type="bibr" target="#b10">[11]</ref> and CHRF 9 <ref type="bibr" target="#b11">[12]</ref> via sacreBLEU <ref type="bibr" target="#b12">[13]</ref> • pretrained models-based: BLEURT <ref type="bibr" target="#b13">[14]</ref> (checkpoint: BLEURT-20) and COMET <ref type="bibr" target="#b14">[15]</ref> (model:</p><formula xml:id="formula_0">wmt22-comet-da).</formula><p>All of them are quality metrics, that is the higher the score the better the translation. The overview of the scores from all these metrics allows for a robust assessment of the quality of individual models, and a fair comparison between different models as well.</p><p>We provide reference performance on our challenge of one of the most popular open LLMs, and four state-of-theart MT models: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>en-it</head><p>Translate the following sentence into Italian: &lt;On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each.&gt; &lt;Nella giornata di lunedí, alcuni scienziati della Scuola di Medicina dell'Università di Stanford hanno annunciato l'invenzione di un nuovo strumento diagnostico capace di ordinare le cellule in base al tipo: un chip minuscolo che può essere stampato utilizzando stampanti a getto di inchiostro al costo di circa 1 centesimo di dollaro l'uno.&gt;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>it-en</head><p>Translate the following sentence into English: &lt;Nella giornata di lunedí, alcuni scienziati della Scuola di Medicina dell'Università di Stanford hanno annunciato l'invenzione di un nuovo strumento diagnostico capace di ordinare le cellule in base al tipo: un chip minuscolo che può essere stampato utilizzando stampanti a getto di inchiostro al costo di circa 1 centesimo di dollaro l'uno.&gt; &lt;On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each.&gt;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Examples of the format of prompts proposed for MT Challenge. Prompt en-it is designed for the translation from English into Italian, prompt it-en for the opposite direction. In both cases, for instructing Llama3-8B-Instruct only one single shot taken from the OPEN dev set is shown, while in experiments of Section 4 four shots are provided to the model. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Statistics of the benchmark in terms of number of segments and of (detokenized) words on English and Italian sides.</p><p>Llama-3-8B-Instruct: 10 a LLM from the Llama 3 model family <ref type="bibr" target="#b4">[5]</ref>. It is an instruction-tuned model, i.e. it is finetuned to align its outputs with the desired response characteristics <ref type="bibr" target="#b15">[16]</ref>, in this case for assistant-like chat. Therefore, we provide the 4-shot prompts described in Section 3.2 as input for the model in a chat format, with user role messages with the instruction and the input and assistant role messages with the corresponding output. 11</p><p>HelsinkiMT: 12 the Language Technology Research Group at the University of Helsinki made available under the CC-BY-4.0 license a set of neural MT models trained with MarianNMT 13 on OPUS data, 14 including English-Italian 15  and Italian-English 16 models. mBART50: 17 a multilingual neural translation model that covers any pair from a set of 50 languages, English and Italian included <ref type="bibr" target="#b16">[17]</ref>. Built by Meta/Facebook on the fairseq toolkit, 18 it is released under the MIT license. Its network has approximately 600M parameters. NLLB: 19 No Language Left Behind (NLLB) is also a multilingual neural translation model that covers any pair from more than 200 languages, including the two we are interested in. The code was developed by Meta/Facebook as a branch of fairseq and is released under the MIT license. Five different NLLB models are available under the CC-BY-NC 4.0 license, which mainly differ in size, ranging from the smallest with 600M parameters to the largest with 54.5B parameters. On the basis of their manageability and official performance claimed by the authors, we decided to include two NLLB models in this investigation, the distilled variant with 1.3B parameters (NLLB_1.3B) and the one with 3.3B parameters (NLLB_3.3B).</p><p>Table <ref type="table">3</ref> provides the scores measured for each model on all evaluation sets of the benchmark, except for the OPEN dev set, since we reserved that subset as the source of the exemplars used for few-shot prompting with Llama-3-8B-Instruct. First of all, we note that the performance of the three multilingual translation models mBART50, NLLB_1.3B and NLLB_3.3B are strictly in increasing order according to their number of parameters, with respect to all metrics (with only one microscopic exception). In general, Llama-3-8B-Instruct performs better than mBART50 and worse than NLLB_1.3B.</p><p>The behavior of HelsinkiMT is more difficult to frame: there are cases in which it is definitely the best performing model (CLOSED-IT, it→en) or at least competitive with NLLB_3.3B (CLOSED-UK, en→it; CLOSED-IT, en→it); others in which it is only slightly better than mBART50 (OPEN devtst, it→en; CLOSED-US, it→en). This can probably be explained by the fact that HelsinkiMT is not a single model, rather a collection of models specifically trained for covering the translation between specific languages. That is, HelsinkiMT en→it and it→en models were trained independently, on different training data. Therefore, it is possible that their performance when compared to that of other models may not be consistent across the various sections of our benchmark.</p><p>In summary, we can state that Llama-3-8B-Instruct, a general purpose, generative model only conditioned towards performing translation by four task exemplars, compares well to translation models; likely, fine-tuning Llama-3-8B-Instruct on the translation task could allow it to achieve even better performance. However, it should be considered that this version of Llama-3-8B-Instruct -which is also the smallest of that model family -has 8B parameters, more than twice the parameters of NLLB_3.3B and an order of magnitude more than mBART50. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Translation results on benchmark of MT models and LLMs. The best scores for each translation direction, subset, and metric are signalled in bold.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Limitations</head><p>Nowadays, LLMs are trained on huge amounts of data mostly crawled from the web. Therefore, as already pointed out in Section 3, it is hard to be sure that there is no data contamination, that is no overlap between training and evaluation data. Data contamination makes the evaluation of LLMs unreliable since their performance may be inflated. Concerning our specific case, the risk that OPEN/FLORES+ data are contaminated is not negligible; however the results shown in Table <ref type="table">3</ref>, which are good but realistic, do not seem to indicate any contamination.</p><p>In theory, the contamination risk of the CLOSED section is lower than for the CLOSED one, since the translations of the original texts have never been released. On the other hand, original texts are available on the web (although only for private use), therefore it cannot be ruled out that the models "know" them, in some way. For example, the exceptionally high results of HelsinkiMT on the CLOSED-IT set seem to be an anomaly, likely due to data contamination.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Ethical issues</head><p>Our proposal does not focus on ethically charged topics. While the data we propose for the evaluation of automatic translation may mention sensitive topics or be afflicted by ethical issues such as social biases (e.g., gender bias), here we focus solely on MT quality evaluation and leave the investigation of ethical aspects to other resources and analyses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Data license and copyright issues</head><p>The OPEN section of our benchmark is part of the FLO-RES+ dataset which is licensed under the Creative Commons Attribution Share Alike 4.0 International, 20 which requires derivatives to be distributed under the same or a similar, compatible license. We opted for the same license.</p><p>There is no license associated with the CLOSED part of our benchmark as it is not distributed and can only be used by CALAMITA Organizers for evaluation purposes.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>8</head><label></label><figDesc>sacreBLEU signature: nrefs:1|case:mixed| |eff:no|tok:13a|smooth:exp|version:2.0.0 9 sacreBLEU signature: nrefs:1|case:mixed| |eff:yes|nc:6|nw:0|space:no|version:2.0.0 prompt content</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://en.wikipedia.org/wiki/DeepL_Translator</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_1">https://github.com/openlanguagedata/flores</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_2">https://huggingface.co/datasets/FBK-MT/ MAGNETbenchmark4CALAMITA24</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The work presented in this paper is funded by the European Union's Horizon research and innovation programme under grant agreement No 101135798, project Meetween (My Personal AI Mediator for Virtual MEETtings BetWEEN People) and the PNRR project FAIR -Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU.</p></div>
			</div>


			<div type="availability">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>(L. Bentivogli) https://mt.fbk.eu/author/cettolo/ (M. Cettolo); https://mt.fbk.eu/author/apiergentili/ (A. Piergentili); https://mt.fbk.eu/author/spapi/ (S. Papi); https://mt.fbk.eu/author/mgaido/ (M. Gaido); https://mt.fbk.eu/author/negri/ (M. Negri); https://mt.fbk.eu/author/bentivogli/ (2024-suhoe/ 3 https://www2.statmt.org/wmt24/translation-task.html 4 https://iwslt.org/2024/#shared-tasks 5 https://www2.statmt.org/wmt24/metrics-task.html</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Herbert-Voss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Henighan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ziegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Winter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hesse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sigler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Litwin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Berner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mccandlish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing 20</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="1877" to="1901" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><surname>Openai</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2303.08774.arXiv:2303.08774" />
		<title level="m">Gpt-4</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">technical report</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lavril</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Izacard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Martinet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-A</forename><surname>Lachaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lacroix</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Rozière</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hambro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Azhar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Grave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lample</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2302.13971.arXiv:2302.13971" />
		<title level="m">Llama: Open and efficient foundation language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Llama 2: Open foundation and finetuned chat models</title>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2307.09288.arXiv:2307.09288" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">The Llama 3 herd of models</title>
		<author>
			<persName><forename type="first">A</forename><surname>Dubey</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2407.21783.arXiv:2407.21783" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian</title>
		<author>
			<persName><forename type="first">G</forename><surname>Attanasio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Basile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Borazio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Croce</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Francis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gili</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Musacchio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nissim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Patti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rinaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Scalena</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting>the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)<address><addrLine>Pisa, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024-12-06">December 4 -December 6, 2024. 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Nllb Team</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Costa-Jussà</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Cross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Çelebi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Elbayad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Heafield</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Heffernan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kalbassi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Licht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Maillard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wenzek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Youngblood</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Akula</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Barrault</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mejia-Gonzalez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hansanti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hoffman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">R</forename><surname>Jarrett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Sadagopan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rowe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Spruit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">F</forename><surname>Andrews</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ayan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bhosale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Edunov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Goswami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Guzmán</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Koehn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Mourachko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ropers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Saleem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schwenk</surname></persName>
		</author>
		<author>
			<persName><surname>Wang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1902.01382</idno>
		<title level="m">No language left behind: Scaling human-centered machine translation</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Investigating data contamination in modern benchmarks for large language models</title>
		<author>
			<persName><forename type="first">C</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gerstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Cohan</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.naacl-long.482" />
	</analytic>
	<monogr>
		<title level="m">Proc. of NAACL</title>
		<title level="s">Long Papers</title>
		<meeting>of NAACL<address><addrLine>Mexico City, Mexico</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="8706" to="8719" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent</title>
		<author>
			<persName><forename type="first">M</forename><surname>Freitag</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Mathur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>-K. Lo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Avramidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Thompson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kocmi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Blain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Deutsch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Stewart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zerva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Castilho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Foster</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2023.wmt-1.51" />
	</analytic>
	<monogr>
		<title level="m">Proc. of WMT</title>
				<meeting>of WMT<address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="578" to="628" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">To ship or not to ship: An extensive evaluation of automatic metrics for machine translation</title>
		<author>
			<persName><forename type="first">T</forename><surname>Kocmi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Federmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Grundkiewicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Junczys-Dowmunt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Matsushita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Menezes</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2021.wmt-1.57" />
	</analytic>
	<monogr>
		<title level="m">Proc. of WMT, Online</title>
				<meeting>of WMT, Online</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="478" to="494" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">BLEU: a Method for Automatic Evaluation of Machine Translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-J</forename><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of ACL</title>
				<meeting>of ACL<address><addrLine>Philadelphia, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="311" to="318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">chrF: character n-gram F-score for automatic MT evaluation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Popovic</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/W15-3049" />
	</analytic>
	<monogr>
		<title level="m">Proc. of WMT</title>
				<meeting>of WMT<address><addrLine>Lisbon, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="392" to="395" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">A Call for Clarity in Reporting BLEU Scores</title>
		<author>
			<persName><forename type="first">M</forename><surname>Post</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/W18-6319" />
	</analytic>
	<monogr>
		<title level="m">Proc. of WMT</title>
				<meeting>of WMT<address><addrLine>Belgium, Brussels</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="186" to="191" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">BLEURT: Learning robust metrics for text generation</title>
		<author>
			<persName><forename type="first">T</forename><surname>Sellam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Parikh</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2020.acl-main.704" />
	</analytic>
	<monogr>
		<title level="m">Proc. of ACL, Online</title>
				<meeting>of ACL, Online</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="7881" to="7892" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">COMET-22: Unbabel-IST 2022 submission for the metrics shared task</title>
		<author>
			<persName><forename type="first">R</forename><surname>Rei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">G C</forename><surname>De Souza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Alves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zerva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Farinha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Glushkova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Coheur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">F T</forename><surname>Martins</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2022.wmt-1.52" />
	</analytic>
	<monogr>
		<title level="m">Proc. of WMT</title>
				<meeting>of WMT<address><addrLine>Abu Dhabi, United Arab Emirates (Hybrid</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="578" to="585" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Instruction tuning for large language models: A survey</title>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wang</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2308.10792.arXiv:2308.10792" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Multilingual translation with extensible multilingual pretraining and finetuning</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Chaudhary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fan</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2008.00401.arXiv:2008.00401" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
