<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MAGNET - MAchines GeNErating Translations: A CALAMITA Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mauro Cettolo</string-name>
          <email>cettolo@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Piergentili</string-name>
          <email>apiergentili@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Papi</string-name>
          <email>spapi@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Gaido</string-name>
          <email>mgaido@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Negri</string-name>
          <email>negri@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luisa Bentivogli</string-name>
          <email>bentivo@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose MAGNET - MAchines GeNErating Translations, a CALAMITA Challenge which aims at testing the ability of large language models (LLMs) in the hot topic of automatic translation, focusing on Italian and English (in both directions) to overcome the marginality with which Italian is considered by the machine translation community. We propose a benchmark composed of two portions with diferent distribution policies (one free to use, the other not discloseable), allowing to handle data contamination issues. The publicly available section of the benchmark is distributed on Hugging Face, whereas in this report we describe the details of our challenge, including the prompt formats to be used. Additionally, we report the performance of five models, including a LLM and diferent sized translation models, in terms of four evaluation metrics, whose scores allow an overall evaluation of the quality of the automatically generated translations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine translation</kwd>
        <kwd>English-Italian</kwd>
        <kwd>FLORES+</kwd>
        <kwd>Bleu</kwd>
        <kwd>ChrF</kwd>
        <kwd>Bleurt</kwd>
        <kwd>Comet</kwd>
        <kwd>Llama3-8B-Instruct</kwd>
        <kwd>mBART50</kwd>
        <kwd>NLLB</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <p>
        Machine Translation (MT) refers to the process, carried out
by a computer program, of translating text from one
language to another without human involvement. The idea of
using digital computers to translate natural languages dates
back to the 1940s, making MT one of the oldest fields of
artificial intelligence. Since then, the improvement in translation
quality has been constant and achieved through increasingly
efective approaches (rule-, example- and statistical-based);
however, the most significant advances have likely been
observed over the last few years, thanks to the introduction
of neural networks. Neural models specifically trained for
accomplishing the translation task, like DeepL Translator,1
reach outstanding quality, even if the so-called human
parity has not been achieved yet, especially in unrestricted
domains and for language pairs not involving English.
Recently, an alternative neural-based method is gathering a
lot of interest due to its undoubted potential; it consists in
prompting generative large language models (LLMs), like
GPT models [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] and the LLama model family [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ], to
translate a text. Whatever the approach, the MT research
community is much focused on the development and
validation of models covering English and few other languages,
paying little attention or completely neglecting the vast
majority of the more than 7,000 languages spoken in the
world, including Italian. On the other hand, the global MT
market size was valued at USD 847.24 million in 2021 and is
expected to expand at a compound annual growth rate of
16.4% in 2024-2031, reaching USD 2107.56 million by 2027.2
Being Europe, and then Italy, one of the leading regions for
the MT market, CALAMITA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] cannot miss MT. Therefore
we propose the challenge of testing the LLMs ability in the
hot topic of automatic translation, focusing on Italian and
English (in both directions) to overcome the marginality
with which Italian is considered by the MT community.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Challenge: Description</title>
      <p>The MAGNET challenge provides a framework for assessing
the ability of LLMs in translating Italian text into English and
vice-versa. It is organized following the blueprint of other
long-standing MT shared tasks, such as those proposed
in the WMT3 and IWSLT4 conferences, where Organizers
prepare and distribute development and test sets, define the
training conditions, possibly providing specific training data,
establish the evaluation modalities, typically via automatic
metrics and occasionally enriched by human evaluations,
collect and evaluate participants’ submissions, and finally
disclose the results.</p>
      <p>The MAGNET challenge supplies a benchmark divided in
two portions: one based on a publicly available MT
benchmark and a private one (see Section 3). This allows
participants not only to evaluate their models but possibly to
also fine-tune them, by exploiting the open portion of the
MAGNET benchmark for development purposes.</p>
      <p>Multiple evaluation metrics are employed so as to have a
comprehensive overview of the quality of the translations
generated by a specific model. Indeed, shared tasks on
automatic metrics are still being organized,5 as evidence of
the fact that none of the metrics designed up to now by the
scientific community has proven capable of covering every
single aspect that defines a “good” translation by itself .
2https://www.linkedin.com/pulse/machine-translation-mt-market-
size2024-suhoe/
3https://www2.statmt.org/wmt24/translation-task.html
4https://iwslt.org/2024/#shared-tasks
5https://www2.statmt.org/wmt24/metrics-task.html</p>
      <p>In addition, in order to allow for comparisons, scores
measured on the translation generated by Llama3-8B-Instruct
and a number of other models are made available (see
Section 4).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data description</title>
      <p>We test LLMs’ ability to translate between Italian and
English using a parallel corpus composed of two parts: an
OPEN portion and a CLOSED one.</p>
      <p>
        OPEN For the OPEN portion of the MAGNET benchmark
we propose FLORES+, the latest version of FLORES-2006 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
a multilingual MT evaluation benchmark released under CC
BY-SA 4.0 by FAIR researchers at Meta. It consists of English
sentences sampled in equal amounts from Wikinews (an
international news source), Wikijunior (a collection of
ageappropriate non-fiction books), and Wikivoyage (a travel
guide), translated into more than 200 languages, including
Italian. Dev and devtest sets consisting of about 1,000
segments each are provided. See Section 3.3 for statistics on
this portion of the MAGNET benchmark.
      </p>
      <p>CLOSED The CLOSED subset is a MT test set developed
by FBK by collecting texts of English and Italian news, and
then commissioning their professional translation to a
specialized company. This resource is private and not publicly
accessible. See Section 3.3 for statistics on this portion of
the MAGNET benchmark.</p>
      <p>
        Both subsets allow for the evaluation of MT quality in
both translation directions, i.e. English→Italian and
Italian→English. The decision to split our benchmark in two
subsets is primarily motivated by their current distribution
policy, which is inherently linked to growing concerns about
data contamination [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Data contamination refers to the
possibility that the input-output pairs used in LLM tests
occur in the huge data sets typically used for pre-training
and fine-tuning; such overlap can lead to inflated
benchmark scores, creating an overly favorable impression of an
LLM’s abilities. Although it is challenging to determine with
certainty whether the models being evaluated were trained
on popular datasets scraped from the web, this possibility
should be taken seriously. To promote sound evaluation
and mitigate the efects of biased or potentially
misleading results due to data contamination, one approach is to
rely exclusively on – or at least include among the
benchmarks – “safe” datasets that are either private or have very
controlled/limited distribution. Therefore, pairing a larger,
widely used public dataset (FLORES+) with a smaller,
inhouse dataset – the CLOSED subset – aims to strike a balance
between the thoroughness and the reliability of the
evaluation.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Data format</title>
        <p>The datasets are organized in a parallel text format, i.e.
every entry is composed of a sentence in one language and the
corresponding translation. The OPEN portion of the
benchmark is publicly available on Hugging Face,7 whereas access
6https://github.com/openlanguagedata/flores
7https://huggingface.co/datasets/FBK-MT/
MAGNETbenchmark4CALAMITA24
to the CLOSED portion is only provided to the Organizers
of the task.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Prompts</title>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Detailed data statistics</title>
        <p>In Table 2 detailed statistics are provided on the various
sections of the benchmark in terms of number of segments
(#seg), and of English (|en|) and Italian (|it|) words.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Metrics</title>
      <p>
        We evaluate LLMs’ performance in translation using a set
of four automatic metrics selected in light of the ongoing
challenges in MT evaluation, which still pose an open
problem. New metrics are indeed continually proposed, and
evaluation campaigns aimed at assessing these metrics are
organised periodically (for example, the annual WMT
Metrics Shared Task [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). Broadly, automatic metrics can be
divided into string-based metrics and metrics using
pretrained models, with either group having both strengths
and weaknesses [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Therefore, for a more comprehensive
translation quality evaluation accounting for their
complementarity, we propose to adopt a couple of metrics from
each group, selected among the most commonly used ones:
• string-based: BLEU8 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and CHRF9 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] via
      </p>
      <p>
        sacreBLEU [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
• pretrained models-based: BLEURT [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
(checkpoint: BLEURT-20) and COMET [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] (model:
wmt22-comet-da).
      </p>
      <p>All of them are quality metrics, that is the higher the
score the better the translation. The overview of the scores
from all these metrics allows for a robust assessment of the
quality of individual models, and a fair comparison between
diferent models as well.</p>
      <p>We provide reference performance on our challenge of
one of the most popular open LLMs, and four
state-of-theart MT models:
8sacreBLEU signature: nrefs:1|case:mixed|
|eff:no|tok:13a|smooth:exp|version:2.0.0
9sacreBLEU signature: nrefs:1|case:mixed|
|eff:yes|nc:6|nw:0|space:no|version:2.0.0
prompt
en-it
it-en
content
Translate the following sentence into Italian: &lt;On Monday, scientists from the Stanford University School of
Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that
can be manufactured using standard inkjet printers for possibly about one U.S. cent each.&gt;
&lt;Nella giornata di lunedí, alcuni scienziati della Scuola di Medicina dell’Università di Stanford hanno annunciato
l’invenzione di un nuovo strumento diagnostico capace di ordinare le cellule in base al tipo: un chip minuscolo
che può essere stampato utilizzando stampanti a getto di inchiostro al costo di circa 1 centesimo di dollaro
l’uno.&gt;
Translate the following sentence into English: &lt;Nella giornata di lunedí, alcuni scienziati della Scuola di Medicina
dell’Università di Stanford hanno annunciato l’invenzione di un nuovo strumento diagnostico capace di ordinare
le cellule in base al tipo: un chip minuscolo che può essere stampato utilizzando stampanti a getto di inchiostro
al costo di circa 1 centesimo di dollaro l’uno.&gt;
&lt;On Monday, scientists from the Stanford University School of Medicine announced the invention of a new
diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet
printers for possibly about one U.S. cent each.&gt;</p>
      <p>
        Llama-3-8B-Instruct:10 a LLM from the Llama 3 model
family [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It is an instruction-tuned model, i.e. it is
finetuned to align its outputs with the desired response
characteristics [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], in this case for assistant-like chat. Therefore,
we provide the 4-shot prompts described in Section 3.2 as
input for the model in a chat format, with user role
messages with the instruction and the input and assistant role
messages with the corresponding output.11
      </p>
      <p>HelsinkiMT:12 the Language Technology Research
Group at the University of Helsinki made available under
the CC-BY-4.0 license a set of neural MT models trained with
MarianNMT13 on OPUS data,14 including English-Italian15
and Italian-English16 models.</p>
      <p>
        mBART50:17 a multilingual neural translation model
that covers any pair from a set of 50 languages, English and
Italian included [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Built by Meta/Facebook on the fairseq
toolkit,18 it is released under the MIT license. Its network
has approximately 600M parameters.
      </p>
      <p>NLLB:19 No Language Left Behind (NLLB) is also a
multilingual neural translation model that covers any pair from
more than 200 languages, including the two we are
interested in. The code was developed by Meta/Facebook as a
branch of fairseq and is released under the MIT license. Five
10https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
11https://huggingface.co/docs/transformers/main/en/chat_templating
12https://github.com/Helsinki-NLP/Opus-MT
13https://marian-nmt.github.io/
14https://opus.nlpl.eu/
15https://huggingface.co/Helsinki-NLP/opus-mt-en-it
16https://huggingface.co/Helsinki-NLP/opus-mt-it-en
17https://huggingface.co/facebook/mbart-large-50
18https://github.com/facebookresearch/fairseq
19https://github.com/facebookresearch/fairseq/tree/nllb
diferent NLLB models are available under the CC-BY-NC
4.0 license, which mainly difer in size, ranging from the
smallest with 600M parameters to the largest with 54.5B
parameters. On the basis of their manageability and oficial
performance claimed by the authors, we decided to include
two NLLB models in this investigation, the distilled variant
with 1.3B parameters (NLLB_1.3B) and the one with 3.3B
parameters (NLLB_3.3B).</p>
      <p>Table 3 provides the scores measured for each model on
all evaluation sets of the benchmark, except for the OPEN
dev set, since we reserved that subset as the source of the
exemplars used for few-shot prompting with
Llama-3-8BInstruct. First of all, we note that the performance of the
three multilingual translation models mBART50, NLLB_1.3B
and NLLB_3.3B are strictly in increasing order according
to their number of parameters, with respect to all metrics
(with only one microscopic exception). In general,
Llama-38B-Instruct performs better than mBART50 and worse than
NLLB_1.3B.</p>
      <p>The behavior of HelsinkiMT is more dificult to frame:
there are cases in which it is definitely the best
performing model (CLOSED-IT, it→en) or at least competitive with
NLLB_3.3B (CLOSED-UK, en→it; CLOSED-IT, en→it);
others in which it is only slightly better than mBART50 (OPEN
devtst, it→en; CLOSED-US, it→en). This can probably be
explained by the fact that HelsinkiMT is not a single model,
rather a collection of models specifically trained for
covering the translation between specific languages. That is,
HelsinkiMT en→it and it→en models were trained
independently, on diferent training data. Therefore, it is possible
that their performance when compared to that of other
models may not be consistent across the various sections of our
benchmark.</p>
      <p>In summary, we can state that Llama-3-8B-Instruct, a
general purpose, generative model only conditioned towards
performing translation by four task exemplars, compares
well to translation models; likely, fine-tuning
Llama-3-8BInstruct on the translation task could allow it to achieve
even better performance. However, it should be considered
that this version of Llama-3-8B-Instruct – which is also the
smallest of that model family – has 8B parameters, more
than twice the parameters of NLLB_3.3B and an order of
magnitude more than mBART50.
system</p>
      <sec id="sec-4-1">
        <title>HelsinkiMT</title>
        <p>mBART50</p>
      </sec>
      <sec id="sec-4-2">
        <title>NLLB_1.3B</title>
      </sec>
      <sec id="sec-4-3">
        <title>NLLB_3.3B</title>
      </sec>
      <sec id="sec-4-4">
        <title>Llama-3-8B-Instruct</title>
      </sec>
      <sec id="sec-4-5">
        <title>HelsinkiMT</title>
        <p>mBART50</p>
      </sec>
      <sec id="sec-4-6">
        <title>NLLB_1.3B</title>
      </sec>
      <sec id="sec-4-7">
        <title>NLLB_3.3B</title>
      </sec>
      <sec id="sec-4-8">
        <title>Llama-3-8B-Instruct</title>
      </sec>
      <sec id="sec-4-9">
        <title>HelsinkiMT</title>
        <p>mBART50</p>
      </sec>
      <sec id="sec-4-10">
        <title>NLLB_1.3B</title>
      </sec>
      <sec id="sec-4-11">
        <title>NLLB_3.3B</title>
      </sec>
      <sec id="sec-4-12">
        <title>Llama-3-8B-Instruct</title>
      </sec>
      <sec id="sec-4-13">
        <title>HelsinkiMT</title>
        <p>mBART50</p>
      </sec>
      <sec id="sec-4-14">
        <title>NLLB_1.3B</title>
      </sec>
      <sec id="sec-4-15">
        <title>NLLB_3.3B</title>
      </sec>
      <sec id="sec-4-16">
        <title>Llama-3-8B-Instruct BLEU</title>
      </sec>
      <sec id="sec-4-17">
        <title>OPEN – devtst</title>
        <p>0.8656 27.53
0.8494 23.88
0.8774 29.31
0.8805 29.95
0.8795 26.36</p>
      </sec>
      <sec id="sec-4-18">
        <title>CLOSED – UK</title>
        <p>0.8949 57.35
0.8776 47.46
0.8954 55.12
0.8968 56.00
0.8985 39.29</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Limitations</title>
      <p>Nowadays, LLMs are trained on huge amounts of data
mostly crawled from the web. Therefore, as already pointed
out in Section 3, it is hard to be sure that there is no data
contamination, that is no overlap between training and
evaluation data. Data contamination makes the evaluation of
LLMs unreliable since their performance may be inflated.</p>
      <p>Concerning our specific case, the risk that OPEN/FLORES+
data are contaminated is not negligible; however the results
shown in Table 3, which are good but realistic, do not seem
to indicate any contamination.</p>
      <p>In theory, the contamination risk of the CLOSED section is
lower than for the CLOSED one, since the translations of the
original texts have never been released. On the other hand,
original texts are available on the web (although only for
private use), therefore it cannot be ruled out that the models
“know” them, in some way. For example, the exceptionally
high results of HelsinkiMT on the CLOSED-IT set seem to
be an anomaly, likely due to data contamination.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Ethical issues</title>
      <p>Our proposal does not focus on ethically charged topics.
While the data we propose for the evaluation of automatic
translation may mention sensitive topics or be aflicted by
ethical issues such as social biases (e.g., gender bias), here we
focus solely on MT quality evaluation and leave the
investigation of ethical aspects to other resources and analyses.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Data license and copyright issues</title>
      <p>The OPEN section of our benchmark is part of the
FLORES+ dataset which is licensed under the Creative Commons
Attribution Share Alike 4.0 International,20 which requires
derivatives to be distributed under the same or a similar,
compatible license. We opted for the same license.</p>
      <p>There is no license associated with the CLOSED part of
our benchmark as it is not distributed and can only be used
by CALAMITA Organizers for evaluation purposes.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The work presented in this paper is funded by the
European Union’s Horizon research and innovation programme
under grant agreement No 101135798, project Meetween
(My Personal AI Mediator for Virtual MEETtings BetWEEN
People) and the PNRR project FAIR - Future AI Research
(PE00000013), under the NRRP MUR program funded by the
NextGenerationEU.
20https://github.com/openlanguagedata/flores/blob/main/LICENSE</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herbert-Voss</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Winter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          , E. Sigler,
          <string-name>
            <given-names>M.</given-names>
            <surname>Litwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          . URL: https: //proceedings.neurips.cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] OpenAI, Gpt-4
          <source>technical report</source>
          ,
          <year>2024</year>
          . URL: https:// arxiv.org/abs/2303.08774. arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, G. Lample,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2302.13971. arXiv:
          <volume>2302</volume>
          .
          <fpage>13971</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and finetuned chat models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/ 2307.09288. arXiv:
          <volume>2307</volume>
          .
          <fpage>09288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          , et al.,
          <source>The Llama 3 herd of models</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.21783. arXiv:
          <volume>2407</volume>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borazio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Francis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gili</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Scalena</surname>
          </string-name>
          ,
          <article-title>CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian</article-title>
          ,
          <source>in: Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ), Pisa, Italy, December 4 - December 6,
          <year>2024</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>NLLB</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Costa-jussà</surname>
          </string-name>
          , J.
          <string-name>
            <surname>Cross</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Çelebi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Elbayad</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Heafield</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Hefernan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Kalbassi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Licht</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Maillard</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Wenzek</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Youngblood</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Akula</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Barrault</surname>
            , G. MejiaGonzalez, P. Hansanti,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hofman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Jarrett</surname>
            ,
            <given-names>K. R.</given-names>
          </string-name>
          <string-name>
            <surname>Sadagopan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Rowe</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Spruit</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Andrews</surname>
            ,
            <given-names>N. F.</given-names>
          </string-name>
          <string-name>
            <surname>Ayan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bhosale</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Edunov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Goswami</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Guzmán</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mourachko</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ropers</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Saleem</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schwenk</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>No language left behind: Scaling human-centered machine translation</article-title>
          ,
          <year>2022</year>
          . arXiv:arXiv:
          <year>1902</year>
          .01382.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gerstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <article-title>Investigating data contamination in modern benchmarks for large language models</article-title>
          ,
          <source>in: Proc. of NAACL</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Mexico City, Mexico</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>8706</fpage>
          -
          <lpage>8719</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>482</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Freitag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mathur</surname>
          </string-name>
          , C.
          <article-title>-k.</article-title>
          <string-name>
            <surname>Lo</surname>
            , E. Avramidis,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rei</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Thompson</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Kocmi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Blain</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Deutsch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Stewart</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zerva</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Castilho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lavie</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Foster, Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent</article-title>
          ,
          <source>in: Proc. of WMT</source>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>578</fpage>
          -
          <lpage>628</lpage>
          . URL: https: //aclanthology.org/
          <year>2023</year>
          .wmt-
          <volume>1</volume>
          .
          <fpage>51</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kocmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Federmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grundkiewicz</surname>
          </string-name>
          , M. JunczysDowmunt,
          <string-name>
            <given-names>H.</given-names>
            <surname>Matsushita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Menezes</surname>
          </string-name>
          ,
          <article-title>To ship or not to ship: An extensive evaluation of automatic metrics for machine translation</article-title>
          ,
          <source>in: Proc. of WMT</source>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>478</fpage>
          -
          <lpage>494</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          . wmt-
          <volume>1</volume>
          .
          <fpage>57</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>BLEU: a Method for Automatic Evaluation of Machine Translation</article-title>
          ,
          <source>in: Proc. of ACL</source>
          , Philadelphia, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Popovic</surname>
          </string-name>
          ,
          <article-title>chrF: character n-gram F-score for automatic MT evaluation</article-title>
          ,
          <source>in: Proc. of WMT</source>
          , Lisbon, Portugal,
          <year>2015</year>
          , pp.
          <fpage>392</fpage>
          -
          <lpage>395</lpage>
          . URL: https://aclanthology. org/W15-3049.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Post</surname>
          </string-name>
          ,
          <article-title>A Call for Clarity in Reporting BLEU Scores</article-title>
          ,
          <source>in: Proc. of WMT</source>
          , Belgium, Brussels,
          <year>2018</year>
          , pp.
          <fpage>186</fpage>
          -
          <lpage>191</lpage>
          . URL: https://www.aclweb.org/anthology/W18-6319.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellam</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
          </string-name>
          ,
          <article-title>BLEURT: Learning robust metrics for text generation</article-title>
          ,
          <source>in: Proc. of ACL</source>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>7881</fpage>
          -
          <lpage>7892</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>704</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. G. C. de Souza</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Alves</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zerva</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          <string-name>
            <surname>Farinha</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Glushkova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lavie</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Coheur</surname>
            ,
            <given-names>A. F. T.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , COMET-22:
          <article-title>Unbabel-IST 2022 submission for the metrics shared task</article-title>
          ,
          <source>in: Proc. of WMT</source>
          ,
          <string-name>
            <surname>Abu</surname>
            <given-names>Dhabi</given-names>
          </string-name>
          ,
          <source>United Arab Emirates (Hybrid)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>578</fpage>
          -
          <lpage>585</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .wmt-
          <volume>1</volume>
          .
          <fpage>52</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Wang, Instruction tuning for large language models: A survey</article-title>
          ,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2308.10792. arXiv:
          <volume>2308</volume>
          .
          <fpage>10792</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <article-title>Multilingual translation with extensible multilingual pretraining and finetuning, 2020</article-title>
          . URL: https://arxiv.org/abs/
          <year>2008</year>
          .00401. arXiv:
          <year>2008</year>
          .00401.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>