<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Riccardo Orlando</string-name>
          <email>orlando@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Moroni</string-name>
          <email>moroni@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pere-Lluís Huguet Cabot</string-name>
          <email>huguetcabot@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edoardo Barba</string-name>
          <email>barba@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Conia</string-name>
          <email>conia@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio Orlandini</string-name>
          <email>s.orlandini@cineca.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Fiameni</string-name>
          <email>gfiameni@nvidia.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Navigli</string-name>
          <email>navigli@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CINECA</institution>
          ,
          <addr-line>Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Large Language Models</institution>
          ,
          <addr-line>Language Modeling, Italian Language, LLM Pretraining</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>NVIDIA</institution>
          ,
          <addr-line>Santa Clara, California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Sapienza NLP Group, Dipartimento di Ingegneria Informatica, Automatica e Gestionale, Sapienza University of Rome</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The growing interest in Large Language Models (LLMs) has accelerated research eforts to adapt these models for various languages. Despite this, pretraining LLMs from scratch for non-English languages remains underexplored. This is the case for Italian, where no truly open-source research has investigated the pretraining process. To address this gap, we introduce Minerva (https://nlp.uniroma1.it/minerva), the first family of LLMs trained entirely from scratch on native Italian texts. Our work is the first investigation into the challenges and opportunities of pretraining LLMs specifically for the Italian language, ofering insights into vocabulary design, data composition, and model development. With Minerva, we demonstrate that building an LLM tailored to a specific language yields numerous practical benefits over adapting existing multilingual models, including greater control over the model's vocabulary and the composition of its training data. We provide an overview of the design choices, pretraining methods, and evaluation metrics used to develop Minerva, which shows promising performance on Italian benchmarks and downstream tasks. Moreover, we share the lessons learned throughout Minerva's development to support the academic and industrial communities in advancing non-English LLM research. We believe that Minerva serves as an important step towards closing the gap in high-quality, open-source LLMs for non-English languages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
existing English-centric LLMs to other languages, and</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <sec id="sec-2-1">
        <title>Large Language Models (LLMs) have revolutionized the</title>
        <p>way Natural Language Processing (NLP) tasks are
approached, achieving remarkable results in existing areas
and opening the door to entirely new research directions
and applications. As a result, the energy and resources
ing Italian text using multilingual or language-adapted</p>
      </sec>
      <sec id="sec-2-2">
        <title>English models, e.g., from Mistral [1] or Llama [2, 3], is</title>
        <p>computationally more expensive and often less efective
compared to using a model specifically designed for the</p>
      </sec>
      <sec id="sec-2-3">
        <title>Italian language. This ineficiency stems from the vocab</title>
        <p>units, or tokens, that the model can use to compose text
– when it is not optimized for the Italian language,
resulting in Italian words being split into an excessive number
of tokens. Consequently, this creates longer sequences
of tokens, slower generation times, and higher
computational costs, especially since many popular attention
mechanisms have a quadratic complexity with respect to</p>
        <p>Eforts to create language-specific LLMs are increasing,
ing existing English-centric LLMs to other languages
are enticing: starting with a proven model can reduce
the computational requirements, and adaptation can be
achieved with relatively modest amounts of data. There
are several language adaptation techniques, which range
guage [4, 5] to modifying the model’s architecture [6, 7, 8],
making these techniques flexible for diferent budgets
and objectives. However, these techniques may not fully
capture language-specific nuances and can degrade the
performance in the original language, indeed an
undesirable efect. Alternatively, training LLMs from scratch
provides the freedom to make design choices tailored
to the linguistic features of the target
language—including morphology, lexicon, syntax, and semantics—which
also allows for incorporating culturally relevant
content, reducing biases that might be present in models
primarily trained on English data, thus leading to more
inclusive and accurate representations of language use.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Unfortunately, while there are several eforts on adapt</title>
        <p>ing English-centric LLMs to the Italian language, e.g.,</p>
      </sec>
      <sec id="sec-2-5">
        <title>Llamantino-2 [4], Llamantino-3 [5], DanteLLM [10], and</title>
      </sec>
      <sec id="sec-2-6">
        <title>Camoscio [11], inter alia, there is no truly open-source</title>
        <p>endeavor exploring what can be achieved by training an</p>
      </sec>
      <sec id="sec-2-7">
        <title>LLM from scratch on Italian data.</title>
      </sec>
      <sec id="sec-2-8">
        <title>With this work, we follow the latter path and introduce</title>
      </sec>
      <sec id="sec-2-9">
        <title>Minerva, the first family of LLMs designed specifically</title>
        <p>for the Italian language and pretrained on Italian text.1</p>
      </sec>
      <sec id="sec-2-10">
        <title>We present the design choices for our models, our data</title>
        <p>processing, and the evaluation results regarding our
Minerva LLMs, showing that our models – with 350M, 1B,</p>
      </sec>
      <sec id="sec-2-11">
        <title>3B, and 7B parameters – outperform comparable multi</title>
        <p>lingual models and even rival larger models adapted for</p>
      </sec>
      <sec id="sec-2-12">
        <title>Italian. We conclude with a discussion on the benefits</title>
        <p>and challenges of pretraining LLMs from scratch for the</p>
      </sec>
      <sec id="sec-2-13">
        <title>Italian language, sharing our experience and findings to</title>
        <p>provide valuable insights for the academic and industrial
communities interested in training non-English LLMs
from scratch. Lastly, we describe the technical details of</p>
      </sec>
      <sec id="sec-2-14">
        <title>Minerva-7B, our latest model with 7.4 billion parameters, for which we share our initial results.</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Building a Pretraining Dataset for Italian LLMs</title>
      <p>The field of LLMs is growing at an astonishing pace, with
new models, datasets, benchmarks, and techniques
presented every week. However, over the past few months,
academic and industrial researchers have increasingly
recognized the fundamental role of the data used to
pretrain LLMs. Unsurprisingly, the majority of the leading
companies are not releasing their training data as they
seek to maintain an advantage over the competition, with
very few exceptions (e.g. OLMo by AllenAI [12] and
OpenELM by Apple [13]). In this section, we describe the
diferent sources of data used in the training of the
Minerva models, and Table 1 provides an overview of these
(cf. Appendix A for more details). Most importantly, the
training datasets we used are entirely available online,
making our process transparent and allowing researchers
to better study the connection between pretraining data
and model behavior.</p>
      <sec id="sec-3-1">
        <title>2.1. Data Sources</title>
        <p>Name
RedPajama-V2
CulturaX
Wikipedia
Gutenberg
Wikisource
EurLex
Gazzetta Uficiale
FineWeb
CulturaX
Wikipedia
ArXiv
Gutenberg
StackExchange
The Stack V2
Total # of tokens</p>
        <p>Lang.
Web data. The majority of the text used to train LLMs
is sourced from Web-scraped data, typically from
CommonCrawl (CC). Therefore, a significant portion of Italian
text included in our training datasets is also of this nature,
inherently exposing our models to potential biases and
toxic content commonly found on the Web. Because
preprocessing techniques, such as language identification,
perplexity filtering, deduplication, and content
classification are computationally expensive, the most sensible
choice is thus to rely on preprocessed collections, such
as CulturaX [14] and RedPajama v2 [15]. These
collections already include Italian data, and have undergone
various levels of filtering and deduplication, as discussed
in Section 2.2.</p>
        <sec id="sec-3-1-1">
          <title>Curated data. While Penedo et al. [16] suggest that</title>
          <p>high-quality Web data is suficient on its own to train
LLMs, curated data sources are often used to further
improve the model performance and introduce a broader
diversity of data types, such as encyclopedic and
academic text [17], as well as scientific and math-related
text. Therefore, we include curated texts from several
sources, including Wikipedia (encyclopedic/world
knowledge data), EurLex and Gazzetta Uficiale (law, economics,
and politics), and the Gutenberg Project (novels, poetry,
etc.).</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>The training data for our Minerva models consists of</title>
          <p>three main categories: Italian, English, and code data. 2.1.2. English Data</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>1https://nlp.uniroma1.it/minerva</title>
          <p>Web data. Mirroring our approach with the Italian
data, we use preprocessed collections of English data
from the Web. Given that English is the most popular of text that appear too often so as to minimize
memorizalanguage on the Internet and has been the primary focus tion.
of LLM research, there are numerous options that already As mentioned above, for the corpus used to train the
provide a large amount of tokens from filtered, dedupli- Minerva models, we rely mainly on collections of data
cated, and cleaned sources. For our Minerva-350M, 1B, that has already been filtered and deduplicated. However,
and 3B models, we collect data from the English partition there are some minor considerations that depend on each
of CulturaX, capping the number of tokens to the same collection of data. More specifically, we use CulturaX
amount as the Italian ones, as shown in Table 1. Instead, as-is, relying on their filtering and deduplication pipeline.
to train Minerva-7B, we use a portion of FineWeb [18], Unfortunately, RedPajama v2 is not filtered and
dedupliwhich includes filtered and deduplicated CC dumps with cated; however, its data is tagged with meta-information
various timestamps. Specifically, we use the CC dumps that can be used to apply filtering and deduplication.
from 2023-14 to 2024-18 to match the total number of Such metadata includes, for example, the perplexity score
tokens in the Italian Web partition of our training data. of each text computed via a language model trained on</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>Wikipedia, which is used to partition RedPajama v2 into</title>
          <p>Curated sources. We include the 5.3B tokens from three partitions: head, middle, tail. For our training
corthe English Wikipedia and 7B tokens from the copyright- pus, we only include a document if it is classified as
free books in Project Gutenberg. Additionally, we include head or middle according to its perplexity score.
Moredata from arXiv and StackExchange, which are included over, we use the precomputed metadata to remove exact
in the RedPajama dataset. duplicates and apply fuzzy deduplication. The latter is
performed by using the hash provided for each document
2.1.3. Code Data with Locality Sensitive Hashing and Jaccard similarity 0.7
to decide whether two documents are fuzzy duplicates.</p>
          <p>Previous work has highlighted the importance of includ- Note that we only apply fuzzy deduplication within each
ing source code in the pretraining corpus of an LLM, CC dump, rather than across all the dumps. This decision
in order to improve not only its code understanding is motivated by two observations: first, applying fuzzy
and generation, but also its general reasoning capabil- deduplication across all CC dumps is computationally
ities [19] even for tasks that do not directly involve or expensive; second, previous work [18] has shown that
require programming. Therefore, for our largest model per-CC deduplication is not only suficient, but is also
– Minerva-7B – we also include a portion of code data. beneficial, when training English LLMs.</p>
        </sec>
        <sec id="sec-3-1-5">
          <title>More specifically, we extract 200B tokens from The Stack</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>V2 [20], selecting the data from their deduplicated parti</title>
          <p>tion, which includes 17 of the most popular programming 3. Minerva LLMs
languages on GitHub.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Data Preprocessing</title>
        <sec id="sec-3-2-1">
          <title>As mentioned above, our preprocessing efort remains</title>
          <p>minimal, as we rely on the preprocessing pipelines used
in CulturaX, RedPajama, and FineWeb. To evaluate the
content and quality of our training data, we employ the
methodology described in Elazar et al. [21] to analyze the</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>URL domain distribution within the Italian partition of</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>CulturaX and RedPajama, as these partitions had never been utilized in training an LLM prior to Minerva. We provide an overview of our analysis together with a few insights in Appendix B.</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>The vocabulary of an LLM is mainly impacted by its size,</title>
          <p>i.e., the number of tokens in the vocabulary itself, and
how the tokenizer is trained, i.e., which tokens make up
the vocabulary. These two factors impact the fertility
of the resulting tokenizer, which measures the average
number of tokens (subwords) into which a word is split.</p>
        </sec>
        <sec id="sec-3-2-5">
          <title>Tokenizers with lower fertility are preferable, as the input</title>
          <p>and output sequences they produce are shorter,
result2.3. Data Filtering and Deduplication ing in an eficiency gain, especially as most attention
Previous work on English-centric LLMs [22] has already mechanisms are quadratic with respect to the sequence
emphasized the importance of training LLMs on “clean” length. Unsurprisingly, the vocabulary allocation of an
data. Two of the most important parts of data cleaning English-centric LLM minimizes the fertility of English
are filtering, i.e., removing content that does not satisfy a text, and results in high fertility values for Italian text, as
set of criteria, and deduplication, i.e., removing portions shown in Table 2.</p>
        </sec>
        <sec id="sec-3-2-6">
          <title>In this section, we provide an overview of the Minerva</title>
        </sec>
        <sec id="sec-3-2-7">
          <title>LLMs: we describe their tokenizers, the design choices behind the model architecture, and how we trained the resulting LLMs.</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.1. Vocabulary and Tokenizers</title>
        <p>Tokenizer
Mistral-7B
Gemma-7B
Minerva-350M
Minerva-1B
Minerva-3B
Minerva-7B
|Vocab|
the more recent model releases by Mistral, Minerva-7B
does not use SWA. Instead, it implements full attention
across its entire context length, which can extend up to
4096 tokens, i.e., double the number of tokens for the</p>
        <sec id="sec-3-3-1">
          <title>SWA used in Minerva-350M/1B/3B. The parameters for</title>
          <p>each model size are detailed in Table 3, for which we
provide a more in-depth description in Appendix D.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Building Minerva on top of Mistral’s model architecture also brings other benefits, such as broad compatibility with the ecosystem of libraries, frameworks, and</title>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.3. Model Training</title>
        <sec id="sec-3-4-1">
          <title>We train all the Minerva LLMs using MosaicML’s LLM</title>
          <p>Given the importance for our Minerva LLMs of hav- Foundry.2 The training process is conducted on the
ing a low fertility on Italian text, we intentionally train Leonardo Supercomputer3 hosted and maintained by
the Minerva tokenizer on a balanced mix of English and CINECA. Each node in Leonardo is equipped with 4 ×
Italian data (and code data for the 7B model). Our anal- custom NVIDIA A100 SXM4 with 64GB of VRAM.
ysis shows that this strategy leads to a much improved All our models are trained using the AdamW
optifertility on Italian data, while at the same time maintain- mizer [29] with  1 = 0.9,  2 = 0.95,  = 10 −8 (with the
ing similar fertility on English data. More specifically, only exception being Minerva-7B, which is trained
usfor Minerva-350M/1B/3B, we opted for a vocabulary size ing  = 10 −5) on a standard causal language modeling
similar to that of Mistral-7B (around 32k tokens): in this training objective. To smooth the training process, we
case, the fertility of the Minerva tokenizer is ~20% better follow standard practice in the literature and employ a
than the Mistral tokenizer on the Italian Wikipedia and warmup-then-cooldown learning rate scheduling. More
only ~1% worse on the English Wikipedia. Following specifically, we first increase the learning rate linearly
recent trends in LLMs, for Minerva-7B, we increased the during the initial training phase (2% of the total
numvocabulary size to around 50k tokens, which resulted in ber of training steps for Minerva-350M/1B/3B and 0̃.3%
a further fertility improvement of ~6% and ~5% on the for Minerva-7B) until the peak learning rate is reached
Italian and English Wikipedias, respectively, notwith- (2×10−4 for Minerva-350M/1B/3B, 3×10−4 for
Minervastanding the addition of code data to the training data. 7B), and then decrease the learning rate with a cosine
We provide more details on the tokenizer in Appendix C. scheduling until the end of the training process. The
hyperparameters used for each model are shown in Table 7.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.2. Model Architecture</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>While the field of LLMs is moving rapidly, one of the best
models when our eforts started was Mistral. Therefore, We measure the 0-shot performance of our Minerva LLMs
our Minerva LLMs are based on Mistral’s model archi- on ITA-Bench [30], a suite of benchmarks that have
tecture. The Minerva LLMs are, therefore, a family of been created either by translating existing benchmarks
decoder-only transformer models, with a few standout from other languages, or by adapting existing Italian
features, such as grouped-query attention (GQA) [23], benchmarks so that they can be used for LLM evaluation.
which boosts inference speed and reduces memory re- ITA-Bench includes a set of 10 benchmarks commonly
quirements for increased throughput, and sliding win- used to evaluate LLMs, namely, ARC Challenge
(ARCdow attention (SWA) [24, 25], which manages longer se- C), ARC Easy (ARC-E) [31], BoolQ [32], GSM8K [33],
quences more eficiently at reduced computational costs. HellaSwag (HS) [34], MMLU [35], PIQA [36], SciQ [37],
Specifically, the GQA is configured to share one key-value TruthfulQA [38], and Winogrande (WG) [39]. Overall,
pair every four queries, while the SWA configuration han- these benchmarks ofer a comprehensive view of the
capadles up to 2,048 tokens with a maximum context length of bilities of an LLM on a wide variety of aspects, including
16,384 tokens. We build four models with diferent sizes scientific knowledge, world knowledge (e.g., geography,
by scaling the number of attention heads, hidden size, politics, economics), commonsense knowledge, physical
intermediate size, and hidden layers, while maintaining
a ratio of ~3.5 between the hidden size and intermediate 2https://github.com/mosaicml/llm-foundry
size, as in the original Mistral model. However, following 3https://leonardo-supercomputer.cineca.eu/</p>
      <p>SW Length</p>
      <p>Ctx. Length
interactions, coreference, and math reasoning, among on ITA-Bench every 10,000 training steps, the model is
others. Employing automatically-translated benchmarks still slowly improving towards the end of the pretraining
is far from ideal, but it allows us to better compare the phase, suggesting that a larger training corpus or
multiscores obtained in Italian with those obtained in English, ple epochs may be beneficial in future developments.
while awaiting as the Italian research community
develops Italian-specific benchmarks [ 40].</p>
      <p>As shown in Table 4, the average performance of the 5. Downstream tasks</p>
      <sec id="sec-4-1">
        <title>Minerva models increases steadily with the model size.</title>
        <p>For our 3B model, we also provide a comparison with In this section, we show the results of the Minerva
modtwo models of the same size: XGLM [41], a multilingual els when adapted to two downstream applications. This
LLM by META, and OpenELM [42], a very recent English- analysis is particularly relevant for Minerva-350M and
only model developed by Apple. Our evaluation shows Minerva-1B, which can be utilized for specific tasks rather
that Minerva-3B outperforms XGLM and OpenELM by a than as general-purpose models, ofering lower
computasignificant margin, i.e., +4.4% and +3.7% on average. tional costs. The tasks in this analysis include: i) Italian</p>
        <p>Finally, Minerva-7B achieves the highest performance Abstractive News Summarization, and ii) Machine
Transamong the Minerva LLMs family, as expected. No- lation, in both directions (IT-EN and EN-IT).
tably, Minerva-7B, achieves a higher average score than
Llamantino-2. This is an interesting comparison be- News Summarization. Following Sarti and Nissim
cause the pretraining data for Llama-2, i.e., the pretrained [43], we fine-tune Minerva models (up to 3B) on a
conLLM used to build Llamantino-2, is not available and has catenation of two Italian news summarization datasets:
never been disclosed, making the model open-weights Fanpage.it and Il Post newspapers [44]. A detailed
but not entirely open-source.4 When compared to closed- overview of the hyperparameters used to train our
modsourced LLMs such as Mistral-7B-v0.1 or Llama-3.1-8B, els is provided in Appendix E. We can find that
MinervaMinerva still lags behind in some tasks, such as BoolQ 3B obtains the best results (0.30 vs 0.29 of the second best
or GSM8K, which may require better reasoning capabil- in terms of Rouge-L); however, it is not as
parameterities and/or more pretraining data. As we can observe eficient as IT5-Large, probably because encoder-decoder
from Figure 1, which tracks the progress of Minerva-7B models are more suitable for fine-tuning than
decoderonly models [45]. In Table 8, we report the full results of</p>
      </sec>
      <sec id="sec-4-2">
        <title>Minerva fine-tuned on the aforementioned datasets and</title>
        <p>compared to baselines in Sarti and Nissim [43], which</p>
      </sec>
      <sec id="sec-4-3">
        <title>4We stress that, for Llamantino-2, only the data that has been used</title>
        <p>for the language adaptation process is available, whereas the
pretraining data is not.</p>
        <p>Minerva-7B-base-v1.0
Progress over time: average accuracy on ITA-Bench
include mBART, mT5, and IT5.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Machine Translation. We also evaluate our Minerva</title>
        <p>LLMs in few-shot [46] machine translation on two
benchmarks, FLORES [47] and OPUS-100 [48]. We explore
how LLMs perform this task relying only on
in-contextlearning few-shot examples, reporting our results with</p>
      </sec>
      <sec id="sec-4-5">
        <title>5-shot prompting. We rely on the vLLM library [28] and change the default parameters with temperature=0 and max_tokens=512.</title>
      </sec>
      <sec id="sec-4-6">
        <title>We highlight that Minerva-3B reaches competitive re</title>
        <p>sults in MT in both EN-IT (84.8 on Flores and 76.7 on</p>
      </sec>
      <sec id="sec-4-7">
        <title>Opus in terms of COMET score) and IT-EN (85.7 and 78.0).</title>
        <p>Compared with other models of similar size, Minerva-3B
shows strong results when the target language is Italian
(+1.7 and +2.7 compared to Gemma-2B and Qwen-1.5B
on Opus). Minerva-7B further showcases this by
achieving the highest performance among models tested when
translating from English into Italian. The full results are
reported in Table 5.
minerva) showcase promising results on a variety of
Italian benchmarks and downstream tasks, including news
summarization and machine translation. Most
importantly, we describe, for the first time, the process of
cre6. Conclusion and Future Work ating an Italian pretraining corpus with more than 1T
tokens, and we share findings and insights into the
preIn this paper, we demonstrated the feasibility and bene- training process of Italian LLMs with the academic and
ifts of pretraining Italian language models from scratch, industrial communities, paving the way for future
rewhich not only improves the computational eficiency search in training non-English language models. We
and performance of an LLM for a target language but re- hope that our contributions will represent a stepping
duce linguistic biases inherited from English training cor- stone for future work on language-specific and
multilinpora [49]. The Minerva models (https://nlp.uniroma1.it/ gual large-scale language modeling.
Model
Minerva-1B
Minerva-3B
Minerva-7B
Gemma-2B
Qwen-1.5B
TinyLlama-1.1B-v1.1
LLaMa-2-7B
Mistral-7B
Qwen-7B</p>
        <p>FLORES</p>
        <p>OPUS
EN-IT ↑</p>
        <p>IT-EN ↑</p>
        <p>EN-IT ↑ IT-EN ↑
66.37
84.83
87.02
83.31
80.18
73.40
85.24
86.56
86.00
73.72
85.67
87.20
86.51
86.16
83.62
87.47
87.75
87.66
57.40
76.74
79.07
75.05
74.01
65.72
77.30
78.08
78.50
64.61
78.04
79.91
78.94
78.95
75.44
80.36
80.56
81.21</p>
      </sec>
      <sec id="sec-4-8">
        <title>Edoardo Barba, Simone Conia and Pere-Lluís Huguet</title>
      </sec>
      <sec id="sec-4-9">
        <title>Cabot are fully funded by the PNRR MUR project</title>
      </sec>
      <sec id="sec-4-10">
        <title>PE0000013-FAIR. Roberto Navigli acknowledges the sup</title>
        <p>port of the CREATIVE PRIN project. The authors
acknowledge the CINECA award IsB28_medit under the</p>
      </sec>
      <sec id="sec-4-11">
        <title>ISCRA initiative for the availability of high-performance</title>
        <p>computing resources and support.
//github.com/togethercomputer/RedPajama-Data. Smith, J. Dodge, What’s in my big data?, in: The
[16] G. Penedo, Q. Malartic, D. Hesslow, R. Cojo- Twelfth International Conference on Learning
Repcaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Al- resentations, 2024. URL: https://openreview.net/
mazrouei, J. Launay, The RefinedWeb dataset forum?id=RvfPnOkPV4.
for Falcon LLM: outperforming curated corpora [22] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru,
with web data, and web data only, arXiv preprint A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei,
arXiv:2306.01116 (2023). URL: https://arxiv.org/abs/ J. Launay, The refinedweb dataset for falcon llm:
2306.01116. arXiv:2306.01116. Outperforming curated corpora with web data, and
[17] M. Faysse, P. Fernandes, N. M. Guerreiro, A. Loison, web data only, 2023. URL: https://arxiv.org/abs/
D. M. Alves, C. Corro, N. Boizard, J. Alves, R. Rei, 2306.01116. arXiv:2306.01116.</p>
        <p>P. H. Martins, A. B. Casademunt, F. Yvon, A. F. T. [23] J. Ainslie, J. Lee-Thorp, M. de Jong, Y.
ZemlyanMartins, G. Viaud, C. Hudelot, P. Colombo, Crois- skiy, F. Lebrón, S. Sanghai, Gqa: Training
generalsantllm: A truly bilingual french-english language ized multi-query transformer models from
multimodel, 2024. URL: https://arxiv.org/abs/2402.00786. head checkpoints, arXiv preprint arXiv:2305.13245
arXiv:2402.00786. (2023).
[18] G. Penedo, H. Kydlíček, L. B. allal, A. Lozhkov, [24] R. Child, S. Gray, A. Radford, I. Sutskever,
GeneratM. Mitchell, C. Rafel, L. V. Werra, T. Wolf, The ing long sequences with sparse transformers, arXiv
ifneweb datasets: Decanting the web for the finest preprint arXiv:1904.10509 (2019).
text data at scale, 2024. URL: https://arxiv.org/abs/ [25] I. Beltagy, M. E. Peters, A. Cohan, Longformer:
2406.17557. arXiv:2406.17557. The long-document transformer, arXiv preprint
[19] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, arXiv:2004.05150 (2020).</p>
        <p>
          M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku- [26] G. Gerganov, llama.cpp: Inference of meta’s llama
mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. A. model (and others) in pure c/c++, ???? URL: https:
Cosgrove, C. D. Manning, C. Re, D. Acosta-Navas, //github.com/ggerganov/llama.cpp.
D. A. Hudson, E. Zelikman, E. Durmus, F. Lad- [27] T. Dao, FlashAttention-2: Faster attention with
hak, F. Rong, H. Ren, H. Yao, J. WANG, K. San- better parallelism and work partitioning, in:
Interthanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suz- national Conference on Learning Representations
gun, N. Kim, N. Guha, N. S. Chatterji, O. Khat- (ICLR), 2024.
tab, P. Henderson, Q. Huang, R. A. Chi, S. M. Xie, [28] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H.
S. Santurkar, S. Ganguli, T. Hashimoto, T. Icard, Yu, J. E. Gonzalez, H. Zhang, I. Stoica, Eficient
memT. Zhang, V. Chaudhary, W. Wang, X. Li, Y. Mai, ory management for large language model serving
Y. Zhang, Y. Koreeda, Holistic evaluation of lan- with pagedattention, in: Proceedings of the ACM
guage models, Transactions on Machine Learn- SIGOPS 29th Symposium on Operating Systems
ing Research (2023). URL: https://openre
          <xref ref-type="bibr" rid="ref16">view.net/ Principles, 2023</xref>
          .
forum?id=iO4LZibEqW, featured Certification, Ex- [29] I. Loshchilov, F. Hutter, Decoupled weight decay
pert Certification. regularization, arXiv preprint arXiv:1711.05101
[20] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy- (2017).
        </p>
        <p>Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, [30] L. Moroni, S. Conia, F. Martelli, R. Navigli,
ITAT. Liu, M. Tian, D. Kocetkov, A. Zucker, Y. Belkada, Bench: Towards a more comprehensive evaluation
Z. Wang, Q. Liu, D. Abulkhanov, I. Paul, Z. Li, W.-D. for Italian LLMs, in: CLiC-it, 2024.
Li, M. Risdal, J. Li, J. Zhu, T. Y. Zhuo, E. Zheltonozh- [31] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A.
Sabskii, N. O. O. Dade, W. Yu, L. Krauß, N. Jain, Y. Su, harwal, C. Schoenick, O. Tafjord, Think you have
X. He, M. Dey, E. Abati, Y. Chai, N. Muennighof, solved question answering? try arc, the ai2
reaX. Tang, M. Oblokulov, C. Akiki, M. Marone, C. Mou, soning challenge, arXiv preprint arXiv:1803.05457
M. Mishra, A. Gu, B. Hui, T. Dao, A. Zebaze, (2018).</p>
        <p>O. Dehaene, N. Patry, C. Xu, J. McAuley, H. Hu, [32] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski,
T. Scholak, S. Paquet, J. Robinson, C. J. Anderson, M. Collins, K. Toutanova, Boolq: Exploring the
N. Chapados, M. Patwary, N. Tajbakhsh, Y. Jer- surprising dificulty of natural yes/no questions,
nite, C. M. Ferrandis, L. Zhang, S. Hughes, T. Wolf, arXiv preprint arXiv:1905.10044 (2019).
A. Guha, L. von Werra, H. de Vries, Starcoder [33] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen,
2 and the stack v2: The next generation, 2024. H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton,
arXiv:2402.19173. R. Nakano, et al., Training verifiers to solve math
[21] Y. Elazar, A. Bhagia, I. H. Magnusson, A. Ravichan- word problems, arXiv preprint arXiv:2110.14168
der, D. Schwenk, A. Suhr, E. P. Walsh, D. Groen- (2021).
eveld, L. Soldaini, S. Singh, H. Hajishirzi, N. A. [34] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi,
RedPajama-Data-V2
CulturaX
Wikipedia
Gutenberg
Wikisource
EurLex
Gazzetta Uficiale
FineWeb
CulturaX
Wikipedia
ArXiv
Gutenberg
StackExchange
The Stack V2</p>
        <p>Tokens</p>
        <p>Language
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2</p>
        <p>https://huggingface.co/datasets/uonlp/CulturaX
https://huggingface.co/datasets/wikimedia/wikipedia
https://huggingface.co/datasets/manu/project_gutenberg
https://huggingface.co/datasets/wikimedia/wikisource
https://huggingface.co/datasets/joelito/eurlex_resources
https://huggingface.co/datasets/mii-llm/gazzetta-ufficiale
https://huggingface.co/datasets/HuggingFaceFW/fineweb</p>
        <p>https://huggingface.co/datasets/uonlp/CulturaX
https://huggingface.co/datasets/wikimedia/wikipedia
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T</p>
        <p>https://huggingface.co/datasets/manu/project_gutenberg
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>D. Model</title>
    </sec>
    <sec id="sec-6">
      <title>B. Dataset Insights</title>
      <p>to 7 billion parameters of the largest model,
Minerva</p>
      <sec id="sec-6-1">
        <title>7B. The Minerva family also includes Minerva-1B and</title>
        <p>We leveraged the WIMBD5 library to compute word Minerva-3B, with 1 billion and 3 billion parameters,
recounts per URL domain on CulturaX. We decided not spectively. More specifically, the Minerva-7B model is
to do this for RedPajama v2 or FineWeb as their origi- based directly on the Mistral-7B architecture, with the
nal data already provides token count and other insights sole modifications being the vocabulary size, which we
into the dataset distribution. Figures 2 and 3 show the increase to 51,200 tokens, and the context length, which is
aggregation of word counts per domain for Italian and set to 4,096 tokens without activating the sliding window
English, respectively. attention feature. Hence, Minerva-7B is structured as a
decoder-only transformer model, comprising 32 layers.</p>
        <p>C. Tokenizer Each layer includes 32 attention heads, where each
keyvalue pair is shared among four queries. Additionally,
We trained two tokenizers for Minerva. The first one is the model features feed-forward layers with a hidden
shared by the three smaller sizes, 350M, 1B and 3B. It is size of 4096 and an intermediate size of 14336, which is
trained on a mix of 4GB of Italian text data and 4GB of 3.5 times the hidden size. Minerva-3B is a scaled down
English text data, both from CulturaX. Our objective is version of Minerva-7B, and it shares similar features with
to have a balanced vocabulary across the two languages, Mistral-7B, including a maximum context length of 16,384
mirroring the training data. We use the SentencePiece tokens, sliding window attention spanning 2,048 tokens,
library6 to train a BPE tokenizer and we apply byte fall- and a vocabulary size of 32,768 tokens. To achieve
apback. We set a vocabulary size of 32,768 as a multiple of proximately 3 billion parameters, we have reduced the
8, which is recommended by some GPU architectures. hidden size to 2560 and the intermediate size to 8960.</p>
        <p>For the 7B tokenizer, we increase the vocabulary size Minerva-1B and Minerva-350M difer from their larger
to account for the inclusion of code data, up to 51,200. counterpart in several key respects. Both models have
We also train a BPE tokenizer7 with 4GB of English text, 16 attention heads, in contrast to the higher count in the
4GB of Italian and 1GB of code. The text data is sampled larger model. Additionally, the hidden and
intermedifrom the training mix of datasets for the 7B, as reported ate sizes of the feed-forward layers is reduced further:
in Table 1. Minerva-1B features a hidden size of 2048 and an
intermediate size of 7168, while Minerva-350M has a hidden size
of 1152 and an intermediate size of 4032. The complete
list of parameters is reported in Table 3.</p>
      </sec>
      <sec id="sec-6-2">
        <title>The Minerva LLM family consists of four models, each</title>
        <p>sharing the same underlying architecture, i.e., that of</p>
      </sec>
      <sec id="sec-6-3">
        <title>Mistral-7B. The models are diferentiated by their size,</title>
        <p>ranging from 350 million parameters of Minerva-350M</p>
      </sec>
      <sec id="sec-6-4">
        <title>5https://github.com/allenai/wimbd</title>
      </sec>
      <sec id="sec-6-5">
        <title>6https://github.com/google/sentencepiece</title>
      </sec>
      <sec id="sec-6-6">
        <title>7https://huggingface.co/docs/tokenizers/en/api/trainers</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>E. News Summarization</title>
      <sec id="sec-7-1">
        <title>Additional results. Table 8 reports the full results of our evaluation on news summarization.</title>
        <p>docplayer.it
radioradicale.it</p>
        <p>tripadvisor.it
it.wikipedia.org
it.blastingnews.com
it.m.wikipedia.org
ilsussidiario.net</p>
        <p>treccani.it
sentenze.laleggepertutti.it
ilgiornale.it
slideplayer.it
it.scribd.com
247.libero.it
ilfattoquotidiano.it</p>
        <p>kijiji.it
issuu.com
repubblica.it
medicitalia.it</p>
        <p>corriere.it
it.paperblog.com</p>
        <p>lastampa.it
blitzquotidiano.it
huffingtonpost.it</p>
        <p>tg24.sky.it
L immobiliare.it
RU it.jooble.org</p>
        <p>it.topwar.ru
laleggepertutti.it</p>
        <p>ibs.it
ilsole24ore.com
slideshare.net
it.knowledgr.com
informazione.it
tuttosu.virgilio.it
spaziogames.it
it.notizie.yahoo.com
airbnb.it
caasa.it
blog.giallozafferano.it
icase.it
jetcost.it
amazon.it
affaritaliani.it
emagister.it
it.venere.com
ilfoglio.it
ansa.it
sport.sky.it
artribune.com
documenti.camera.it
0.000
class.8 The hyperparameters we used are reported in
that Minerva-350M and Minerva-1B were finetuned using</p>
      </sec>
      <sec id="sec-7-2">
        <title>AdamW optimizer [29]. Minerva-3B was trained using</title>
        <p>AdamW_Paged_32bit, a lighter version of AdamW, which
in [43]. We also tried out diferent combinations, but we
allows a larger batch size to be used during training.
noticed that the best evaluation scores are given by the</p>
      </sec>
      <sec id="sec-7-3">
        <title>8https://huggingface.co/docs/trl/en/sft_trainer</title>
        <p>google.com
issuu.com
scribd.com
tripadvisor.com
patents.google.com</p>
        <p>docplayer.net
en.wikipedia.org
tripadvisor.co.uk
slideshare.net
amazon.com
nytimes.com
dailymail.co.uk</p>
        <p>scout.com
stackoverflow.com</p>
        <p>archive.org
patentsencyclopedia.com
journals.plos.org
theguardian.com</p>
        <p>frontiersin.org
seekingalpha.com</p>
        <p>law.justia.com
ncbi.nlm.nih.gov
washingtonpost.com
slideplayer.com
L link.springer.com
RU nature.com
fanfiction.net
barnesandnoble.com
reddit.com
hindawi.com
patents.justia.com
amazon.co.uk
medium.com
finance.yahoo.com</p>
        <p>hubpages.com
s3-us-west-1.amazonaws.com
shameface.com.wstub.archive.org
casetext.com
thefreelibrary.com
bleacherreport.com
rightmove.co.uk
prweb.com
airbnb.co.uk
semanticscholar.org
articles.chicagotribune.com</p>
        <p>expedia.com
studymode.com
forums.macrumors.com</p>
        <p>telegraph.co.uk
publications.parliament.uk
0.000</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>F. Few-shot Machine Translation</title>
      <p>Here, we provide more details on our experimental setup
for the Machine Translation task. In our experiments, we
test the capability of a base model (i.e., with no
instruction fine-tuning or task-specific fine-tuning) to translate
a sentence from English to Italian and vice versa.
Previously, LLMs have been shown to perform well in machine
translation and they now rival task-specific MT systems
on a number of benchmarks [50] and tasks [51]. In our
case, we prompt the language models by providing a set
of 5 randomly sampled English-to-Italian translations
(and vice-versa for the Italian-to-English translation).
Finally, we measure the translation performance of the
models using COMET, a learned metric to assess the
quality between an automatic translation and a gold
reference, as COMET has shown better correlation with
human judgement than other metrics, such as BLEU.
mBART Large
mT5 Small
mT5 Base
IT5 Small
IT5 EL32
IT5 Base
IT5 Large
Minerva-350M
Minerva-1B
Minerva-3B
0.34
0.33
0.35
0.34
0.25
0.38
0.16
0.16
0.17
0.16
0.10
0.19
0.26
0.26
0.28
0.26
0.20
0.29
Value</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>tence?</source>
          , arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>07830</volume>
          (
          <year>2019</year>
          ). decoder,
          <source>arXiv preprint arXiv:2304.04052</source>
          (
          <year>2023</year>
          ). [35]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zou</surname>
          </string-name>
          , [46]
          <string-name>
            <given-names>X.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cherry</surname>
          </string-name>
          , G. Foster, M. Krikun,
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Representations (ICLR</surname>
          </string-name>
          ) (
          <year>2021</year>
          ). PMLR,
          <year>2023</year>
          , pp.
          <fpage>10867</fpage>
          -
          <lpage>10878</lpage>
          . [36]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bisk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zellers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          , et al.,
          <source>Piqa</source>
          <volume>:</volume>
          [47]
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-J.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Wen-
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>language, in: Proceedings of the AAAI confer- A. Fan, The flores-101 evaluation benchmark for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>ence on artificial intelligence</source>
          , volume
          <volume>34</volume>
          ,
          <year>2020</year>
          , pp. low
          <article-title>-resource and multilingual machine translation,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          7432-
          <fpage>7439</fpage>
          . Transactions of the Association for Computational [37]
          <string-name>
            <given-names>J.</given-names>
            <surname>Welbl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <source>Crowdsourcing Linguistics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>522</fpage>
          -
          <lpage>538</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>multiple choice science questions</article-title>
          , arXiv preprint [48]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Titov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          , Improv-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>arXiv:1707.06209</source>
          (
          <year>2017</year>
          ).
          <source>ing massively multilingual neural machine trans</source>
          [38]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <article-title>Truthfulqa: Measur- lation and zero-shot translation</article-title>
          , arXiv preprint
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>ing how models mimic human falsehoods</article-title>
          , arXiv arXiv:
          <year>2004</year>
          .
          <volume>11867</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>preprint arXiv:2109.07958</source>
          (
          <year>2021</year>
          ). [49]
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Conia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ross</surname>
          </string-name>
          , Biases in large lan[39]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sakaguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          , Y. Choi, guage models: Origins, inventory, and discussion,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Winogrande</surname>
          </string-name>
          :
          <article-title>An adversarial winograd schema chal-</article-title>
          <source>J. Data and Information Quality</source>
          <volume>15</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          lenge at scale,
          <source>Communications of the ACM</source>
          <volume>64</volume>
          URL: https://doi.org/10.1145/3597307. doi:
          <volume>10</volume>
          .1145/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          (
          <year>2021</year>
          )
          <fpage>99</fpage>
          -
          <lpage>106</lpage>
          .
          <fpage>3597307</fpage>
          . [40]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mercorio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mezzanzanica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Potertì</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Serino</surname>
          </string-name>
          , [50]
          <string-name>
            <given-names>S.</given-names>
            <surname>Conia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Minhas</surname>
          </string-name>
          , I. Ilyas,
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          2024. URL: https://arxiv.org/abs/2406.17535. in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.), Pro-
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>arXiv:2406.17535. ceedings of the 2023 Conference on Empirical</source>
          [41]
          <string-name>
            <given-names>X. V.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mihaylov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <source>Methods in Natural Language Processing</source>
          , Asso-
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>ale</surname>
            , J. Du,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Pasunuru</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Shleifer</surname>
            ,
            <given-names>P. S.</given-names>
          </string-name>
          <string-name>
            <surname>Koura</surname>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>1612</fpage>
          -
          <lpage>1634</lpage>
          . URL: https://aclanthology.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. O</given-names>
            <surname>'Horo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , L. Zettlemoyer, org/
          <year>2023</year>
          .emnlp-main.
          <volume>100</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kozareva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Few-</surname>
          </string-name>
          emnlp-main.
          <volume>100</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>shot learning with multilingual generative lan-</article-title>
          [51]
          <string-name>
            <given-names>S.</given-names>
            <surname>Conia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U. F.</given-names>
            <surname>Minhas</surname>
          </string-name>
          , S. Potdar,
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2022</source>
          Con
          <article-title>- with retrieval-augmented generation from multi-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>guage Processing</surname>
          </string-name>
          ,
          <source>Association for Computational 2024 Conference on Empirical Methods in Natu-</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <year>2022</year>
          , pp.
          <fpage>9019</fpage>
          -
          <lpage>9052</lpage>
          . URL: https://aclanthology. tional Linguistics, Miami, Florida, USA,
          <year>2024</year>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          org/
          <year>2022</year>
          .emnlp-main.
          <volume>616</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          . https://arxiv.org/abs/2410.14057.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          emnlp-main.
          <volume>616</volume>
          . [42]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Sekhavat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Horton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <article-title>guage Model Family with Open Training and In- Table 6 shows the source of each dataset used to train</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <source>ference Framework</source>
          , arXiv.org (
          <year>2024</year>
          ).
          <article-title>URL: https: Minerva in its diferent sizes</article-title>
          .
          <source>The Tokens column shows</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          //arxiv.org/abs/2404.14619v1.
          <article-title>the total number of tokens we used from each dataset</article-title>
          . [43]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sarti</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Nissim, It5: Large-scale text-to-text Where Table 1 shows more tokens used for training, it</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>generation</surname>
          </string-name>
          ,
          <source>arXiv preprint arXiv:2203.03759</source>
          (
          <year>2022</year>
          ).
          <article-title>reach that number</article-title>
          .
          <article-title>All these datasets are openly licensed</article-title>
          . [44]
          <string-name>
            <given-names>N.</given-names>
            <surname>Landro</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. La</given-names>
            <surname>Grassa</surname>
          </string-name>
          , E. Federici, Two
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>summarization, Information</source>
          <volume>13</volume>
          (
          <year>2022</year>
          )
          <fpage>228</fpage>
          . [45]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. M.-C. So</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>Z</given-names>
          </string-name>
          . Liu,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>