=Paper= {{Paper |id=Vol-3878/76_main_long |storemode=property |title=Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data |pdfUrl=https://ceur-ws.org/Vol-3878/76_main_long.pdf |volume=Vol-3878 |authors=Riccardo Orlando,Luca Moroni,Pere-Lluís Huguet Cabot,Simone Conia,Edoardo Barba,Sergio Orlandini,Giuseppe Fiameni,Roberto Navigli |dblpUrl=https://dblp.org/rec/conf/clic-it/OrlandoMCCBOFN24 }} ==Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data== https://ceur-ws.org/Vol-3878/76_main_long.pdf
                                Minerva LLMs: The First Family of Large Language Models
                                Trained from Scratch on Italian Data
                                Riccardo Orlando1,† , Luca Moroni1,† , Pere-Lluís Huguet Cabot1,† , Edoardo Barba1 ,
                                Simone Conia1 , Sergio Orlandini2 , Giuseppe Fiameni3 and Roberto Navigli1,∗
                                1
                                  Sapienza NLP Group, Dipartimento di Ingegneria Informatica, Automatica e Gestionale, Sapienza University of Rome, Italy
                                2
                                  CINECA, Bologna, Italy
                                3
                                  NVIDIA, Santa Clara, California, USA


                                                  Abstract
                                                  The growing interest in Large Language Models (LLMs) has accelerated research efforts to adapt these models for various
                                                  languages. Despite this, pretraining LLMs from scratch for non-English languages remains underexplored. This is the case
                                                  for Italian, where no truly open-source research has investigated the pretraining process. To address this gap, we introduce
                                                  Minerva (https://nlp.uniroma1.it/minerva), the first family of LLMs trained entirely from scratch on native Italian texts. Our
                                                  work is the first investigation into the challenges and opportunities of pretraining LLMs specifically for the Italian language,
                                                  offering insights into vocabulary design, data composition, and model development. With Minerva, we demonstrate that
                                                  building an LLM tailored to a specific language yields numerous practical benefits over adapting existing multilingual models,
                                                  including greater control over the model’s vocabulary and the composition of its training data. We provide an overview of the
                                                  design choices, pretraining methods, and evaluation metrics used to develop Minerva, which shows promising performance
                                                  on Italian benchmarks and downstream tasks. Moreover, we share the lessons learned throughout Minerva’s development to
                                                  support the academic and industrial communities in advancing non-English LLM research. We believe that Minerva serves as
                                                  an important step towards closing the gap in high-quality, open-source LLMs for non-English languages.

                                                  Keywords
                                                  Large Language Models, Language Modeling, Italian Language, LLM Pretraining



                                1. Introduction                                                                                                  units, or tokens, that the model can use to compose text
                                                                                                                                                 – when it is not optimized for the Italian language, result-
                                Large Language Models (LLMs) have revolutionized the ing in Italian words being split into an excessive number
                                way Natural Language Processing (NLP) tasks are ap- of tokens. Consequently, this creates longer sequences
                                proached, achieving remarkable results in existing areas of tokens, slower generation times, and higher compu-
                                and opening the door to entirely new research directions tational costs, especially since many popular attention
                                and applications. As a result, the energy and resources mechanisms have a quadratic complexity with respect to
                                dedicated to the study and creation of LLMs are grow- sequence length.
                                ing exponentially. However, most LLMs – both closed-                                                                Efforts to create language-specific LLMs are increasing,
                                and open-source – are predominantly designed for En- and fall primarily into two main categories: i) adapting
                                glish, posing significant challenges and limitations for existing English-centric LLMs to other languages, and
                                their use in non-English settings. In practice, generat- ii) training LLMs from scratch. The advantages of adapt-
                                ing Italian text using multilingual or language-adapted ing existing English-centric LLMs to other languages
                                English models, e.g., from Mistral [1] or Llama [2, 3], is are enticing: starting with a proven model can reduce
                                computationally more expensive and often less effective the computational requirements, and adaptation can be
                                compared to using a model specifically designed for the achieved with relatively modest amounts of data. There
                                Italian language. This inefficiency stems from the vocab- are several language adaptation techniques, which range
                                ulary of an English or multilingual LLM – i.e., the lexical from fine-tuning the model on data for the target lan-
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, guage [4, 5] to modifying the model’s architecture [6, 7, 8],
                                Dec 04 — 06, 2024, Pisa, Italy                                                                                   making these techniques flexible for different budgets
                                ∗
                                †
                                     Corresponding author.                                                                                       and objectives. However, these techniques may not fully
                                    These authors contributed equally.                                                                           capture language-specific nuances and can degrade the
                                Envelope-Open orlando@diag.uniroma1.it (R. Orlando);                                                             performance in the original language, indeed an unde-
                                moroni@diag.uniroma1.it (L. Moroni);
                                huguetcabot@diag.uniroma1.it (P. H. Cabot);
                                                                                                                                                 sirable effect. Alternatively, training LLMs from scratch
                                barba@diag.uniroma1.it (E. Barba); conia@diag.uniroma1.it                                                        provides the freedom to make design choices tailored
                                (S. Conia); s.orlandini@cineca.it (S. Orlandini);                                                                to the linguistic features of the target language—includ-
                                gfiameni@nvidia.com (G. Fiameni); navigli@diag.uniroma1.it                                                       ing morphology, lexicon, syntax, and semantics—which
                                (R. Navigli)                                                                                                     are often overlooked in English-centric models [9]. It
                                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
also allows for incorporating culturally relevant con-                   Dataset                    Minerva – Model Size
tent, reducing biases that might be present in models          Name                 Lang.     350M      1B      3B         7B
primarily trained on English data, thus leading to more        RedPajama-V2         Italian    –          –      –      894B
inclusive and accurate representations of language use.        CulturaX             Italian   35B       100B   330B     237B
                                                               Wikipedia            Italian    –          –      –       1.3B
Unfortunately, while there are several efforts on adapt-       Gutenberg            Italian    –          –      –     0.15B
ing English-centric LLMs to the Italian language, e.g.,        Wikisource           Italian    –          –      –     0.12B
Llamantino-2 [4], Llamantino-3 [5], DanteLLM [10], and         EurLex               Italian    –          –      –       1.6B
                                                               Gazzetta Ufficiale   Italian    –          –      –       1.7B
Camoscio [11], inter alia, there is no truly open-source       FineWeb              English    –          –      –    1,076B
endeavor exploring what can be achieved by training an         CulturaX             English   35B       100B   330B         –
LLM from scratch on Italian data.                              Wikipedia            English    –          –      –       5.3B
   With this work, we follow the latter path and introduce     ArXiv                English    –          –      –       33B
                                                               Gutenberg            English    –          –      –         7B
Minerva, the first family of LLMs designed specifically        StackExchange        English    –          –      –       22B
for the Italian language and pretrained on Italian text.1      The Stack V2          Code      –          –      –      201B
We present the design choices for our models, our data         Total # of tokens              70B      200B    660B    2.48T
processing, and the evaluation results regarding our Min-
erva LLMs, showing that our models – with 350M, 1B,          Table 1
3B, and 7B parameters – outperform comparable multi-         Datasets used to train Minerva with their languages (second
                                                             column) and number of tokens (third to sixth columns).
lingual models and even rival larger models adapted for
Italian. We conclude with a discussion on the benefits
and challenges of pretraining LLMs from scratch for the
Italian language, sharing our experience and findings to     We only use the code data to train our largest model, i.e.,
provide valuable insights for the academic and industrial    Minerva-7B.
communities interested in training non-English LLMs
from scratch. Lastly, we describe the technical details of   2.1.1. Italian Data
Minerva-7B, our latest model with 7.4 billion parameters,  Web data. The majority of the text used to train LLMs
for which we share our initial results.                    is sourced from Web-scraped data, typically from Com-
                                                           monCrawl (CC). Therefore, a significant portion of Italian
2. Building a Pretraining Dataset text included in our training datasets is also of this nature,
                                                           inherently exposing our models to potential biases and
     for Italian LLMs                                      toxic content commonly found on the Web. Because pre-
                                                           processing techniques, such as language identification,
The field of LLMs is growing at an astonishing pace, with perplexity filtering, deduplication, and content classifi-
new models, datasets, benchmarks, and techniques pre- cation are computationally expensive, the most sensible
sented every week. However, over the past few months, choice is thus to rely on preprocessed collections, such
academic and industrial researchers have increasingly as CulturaX [14] and RedPajama v2 [15]. These collec-
recognized the fundamental role of the data used to pre- tions already include Italian data, and have undergone
train LLMs. Unsurprisingly, the majority of the leading various levels of filtering and deduplication, as discussed
companies are not releasing their training data as they in Section 2.2.
seek to maintain an advantage over the competition, with
very few exceptions (e.g. OLMo by AllenAI [12] and
                                                           Curated data. While Penedo et al. [16] suggest that
OpenELM by Apple [13]). In this section, we describe the
                                                           high-quality Web data is sufficient on its own to train
different sources of data used in the training of the Min-
                                                           LLMs, curated data sources are often used to further im-
erva models, and Table 1 provides an overview of these
                                                           prove the model performance and introduce a broader
(cf. Appendix A for more details). Most importantly, the
                                                           diversity of data types, such as encyclopedic and aca-
training datasets we used are entirely available online,
                                                           demic text [17], as well as scientific and math-related
making our process transparent and allowing researchers
                                                           text. Therefore, we include curated texts from several
to better study the connection between pretraining data
                                                           sources, including Wikipedia (encyclopedic/world knowl-
and model behavior.
                                                           edge data), EurLex and Gazzetta Ufficiale (law, economics,
                                                           and politics), and the Gutenberg Project (novels, poetry,
2.1. Data Sources                                          etc.).
The training data for our Minerva models consists of
three main categories: Italian, English, and code data. 2.1.2. English Data
                                                             Web data. Mirroring our approach with the Italian
1
    https://nlp.uniroma1.it/minerva                          data, we use preprocessed collections of English data
from the Web. Given that English is the most popular           of text that appear too often so as to minimize memoriza-
language on the Internet and has been the primary focus        tion.
of LLM research, there are numerous options that already          As mentioned above, for the corpus used to train the
provide a large amount of tokens from filtered, dedupli-       Minerva models, we rely mainly on collections of data
cated, and cleaned sources. For our Minerva-350M, 1B,          that has already been filtered and deduplicated. However,
and 3B models, we collect data from the English partition      there are some minor considerations that depend on each
of CulturaX, capping the number of tokens to the same          collection of data. More specifically, we use CulturaX
amount as the Italian ones, as shown in Table 1. Instead,      as-is, relying on their filtering and deduplication pipeline.
to train Minerva-7B, we use a portion of FineWeb [18],         Unfortunately, RedPajama v2 is not filtered and dedupli-
which includes filtered and deduplicated CC dumps with         cated; however, its data is tagged with meta-information
various timestamps. Specifically, we use the CC dumps          that can be used to apply filtering and deduplication.
from 2023-14 to 2024-18 to match the total number of           Such metadata includes, for example, the perplexity score
tokens in the Italian Web partition of our training data.      of each text computed via a language model trained on
                                                               Wikipedia, which is used to partition RedPajama v2 into
Curated sources. We include the 5.3B tokens from               three partitions: head, middle, tail. For our training cor-
the English Wikipedia and 7B tokens from the copyright-        pus, we only include a document if it is classified as
free books in Project Gutenberg. Additionally, we include      head or middle according to its perplexity score. More-
data from arXiv and StackExchange, which are included          over, we use the precomputed metadata to remove exact
in the RedPajama dataset.                                      duplicates and apply fuzzy deduplication. The latter is
                                                               performed by using the hash provided for each document
2.1.3. Code Data                                               with Locality Sensitive Hashing and Jaccard similarity 0.7
                                                               to decide whether two documents are fuzzy duplicates.
Previous work has highlighted the importance of includ-        Note that we only apply fuzzy deduplication within each
ing source code in the pretraining corpus of an LLM,           CC dump, rather than across all the dumps. This decision
in order to improve not only its code understanding            is motivated by two observations: first, applying fuzzy
and generation, but also its general reasoning capabil-        deduplication across all CC dumps is computationally
ities [19] even for tasks that do not directly involve or      expensive; second, previous work [18] has shown that
require programming. Therefore, for our largest model          per-CC deduplication is not only sufficient, but is also
– Minerva-7B – we also include a portion of code data.         beneficial, when training English LLMs.
More specifically, we extract 200B tokens from The Stack
V2 [20], selecting the data from their deduplicated parti-
tion, which includes 17 of the most popular programming        3. Minerva LLMs
languages on GitHub.
                                                               In this section, we provide an overview of the Minerva
                                                               LLMs: we describe their tokenizers, the design choices
2.2. Data Preprocessing                                        behind the model architecture, and how we trained the
As mentioned above, our preprocessing effort remains           resulting LLMs.
minimal, as we rely on the preprocessing pipelines used
in CulturaX, RedPajama, and FineWeb. To evaluate the           3.1. Vocabulary and Tokenizers
content and quality of our training data, we employ the
methodology described in Elazar et al. [21] to analyze the    The vocabulary of an LLM is mainly impacted by its size,
URL domain distribution within the Italian partition of       i.e., the number of tokens in the vocabulary itself, and
CulturaX and RedPajama, as these partitions had never         how the tokenizer is trained, i.e., which tokens make up
been utilized in training an LLM prior to Minerva. We         the vocabulary. These two factors impact the fertility
provide an overview of our analysis together with a few       of the resulting tokenizer, which measures the average
insights in Appendix B.                                       number of tokens (subwords) into which a word is split.
                                                              Tokenizers with lower fertility are preferable, as the input
                                                              and output sequences they produce are shorter, result-
2.3. Data Filtering and Deduplication                         ing in an efficiency gain, especially as most attention
Previous work on English-centric LLMs [22] has already mechanisms are quadratic with respect to the sequence
emphasized the importance of training LLMs on “clean” length. Unsurprisingly, the vocabulary allocation of an
data. Two of the most important parts of data cleaning English-centric LLM minimizes the fertility of English
are filtering, i.e., removing content that does not satisfy a text, and results in high fertility values for Italian text, as
set of criteria, and deduplication, i.e., removing portions shown in Table 2.
                                Fertility (↓ – lower is better )
                                                              the more recent model releases by Mistral, Minerva-7B
                                 CulturaX         Wikipedia   does not use SWA. Instead, it implements full attention
  Tokenizer         |Vocab|          Ita  Eng    Ita   Eng    across its entire context length, which can extend up to
                                                              4096 tokens, i.e., double the number of tokens for the
  Mistral-7B          32,000        1.87 1.32 2.05     1.57
  Gemma-7B           256,000        1.42 1.18 1.56     1.34
                                                              SWA used in Minerva-350M/1B/3B. The parameters for
  Minerva-350M        32,768        1.39 1.32 1.66     1.59   each model size are detailed in Table 3, for which we
  Minerva-1B          32,768        1.39 1.32 1.66     1.59   provide a more in-depth description in Appendix D.
  Minerva-3B          32,768        1.39 1.32 1.66     1.59      Building Minerva on top of Mistral’s model architec-
  Minerva-7B          51,200        1.32 1.26 1.56     1.51   ture also brings other benefits, such as broad compati-
Table 2
                                                              bility with the ecosystem of libraries, frameworks, and
Fertility rates (lower is better) for Minerva tokenizers com- tools that has emerged over recent months, including
pared to other LLMs. The fertility rates are computed on llama.cpp [26], FlashAttention [27], and vLLM [28].
a randomly sampled collection of texts from CulturaX and
Wikipedia in both Italian (Ita) and English (Eng).                 3.3. Model Training
                                                                   We train all the Minerva LLMs using MosaicML’s LLM
   Given the importance for our Minerva LLMs of hav-               Foundry.2 The training process is conducted on the
ing a low fertility on Italian text, we intentionally train        Leonardo Supercomputer3 hosted and maintained by
the Minerva tokenizer on a balanced mix of English and             CINECA. Each node in Leonardo is equipped with 4 ×
Italian data (and code data for the 7B model). Our anal-           custom NVIDIA A100 SXM4 with 64GB of VRAM.
ysis shows that this strategy leads to a much improved                All our models are trained using the AdamW opti-
fertility on Italian data, while at the same time maintain-        mizer [29] with 𝛽1 = 0.9, 𝛽2 = 0.95, 𝑒𝑝𝑠 = 10−8 (with the
ing similar fertility on English data. More specifically,          only exception being Minerva-7B, which is trained us-
for Minerva-350M/1B/3B, we opted for a vocabulary size             ing 𝑒𝑝𝑠 = 10−5 ) on a standard causal language modeling
similar to that of Mistral-7B (around 32k tokens): in this         training objective. To smooth the training process, we
case, the fertility of the Minerva tokenizer is ~20% better        follow standard practice in the literature and employ a
than the Mistral tokenizer on the Italian Wikipedia and            warmup-then-cooldown learning rate scheduling. More
only ~1% worse on the English Wikipedia. Following                 specifically, we first increase the learning rate linearly
recent trends in LLMs, for Minerva-7B, we increased the            during the initial training phase (2% of the total num-
vocabulary size to around 50k tokens, which resulted in            ber of training steps for Minerva-350M/1B/3B and 0̃.3%
a further fertility improvement of ~6% and ~5% on the              for Minerva-7B) until the peak learning rate is reached
Italian and English Wikipedias, respectively, notwith-             (2×10−4 for Minerva-350M/1B/3B, 3×10−4 for Minerva-
standing the addition of code data to the training data.           7B), and then decrease the learning rate with a cosine
We provide more details on the tokenizer in Appendix C.            scheduling until the end of the training process. The hy-
                                                                   perparameters used for each model are shown in Table 7.

3.2. Model Architecture
While the field of LLMs is moving rapidly, one of the best
                                                                   4. Evaluation
models when our efforts started was Mistral. Therefore,            We measure the 0-shot performance of our Minerva LLMs
our Minerva LLMs are based on Mistral’s model archi-               on ITA-Bench [30], a suite of benchmarks that have
tecture. The Minerva LLMs are, therefore, a family of              been created either by translating existing benchmarks
decoder-only transformer models, with a few standout               from other languages, or by adapting existing Italian
features, such as grouped-query attention (GQA) [23],              benchmarks so that they can be used for LLM evaluation.
which boosts inference speed and reduces memory re-                ITA-Bench includes a set of 10 benchmarks commonly
quirements for increased throughput, and sliding win-              used to evaluate LLMs, namely, ARC Challenge (ARC-
dow attention (SWA) [24, 25], which manages longer se-             C), ARC Easy (ARC-E) [31], BoolQ [32], GSM8K [33],
quences more efficiently at reduced computational costs.           HellaSwag (HS) [34], MMLU [35], PIQA [36], SciQ [37],
Specifically, the GQA is configured to share one key-value         TruthfulQA [38], and Winogrande (WG) [39]. Overall,
pair every four queries, while the SWA configuration han-          these benchmarks offer a comprehensive view of the capa-
dles up to 2,048 tokens with a maximum context length of           bilities of an LLM on a wide variety of aspects, including
16,384 tokens. We build four models with different sizes           scientific knowledge, world knowledge (e.g., geography,
by scaling the number of attention heads, hidden size,             politics, economics), commonsense knowledge, physical
intermediate size, and hidden layers, while maintaining
a ratio of ~3.5 between the hidden size and intermediate           2
                                                                       https://github.com/mosaicml/llm-foundry
size, as in the original Mistral model. However, following         3
                                                                       https://leonardo-supercomputer.cineca.eu/
  Model                 Params   Layers        Hidden Size     Inter. Size      Att. Heads     KV Heads        SW Length        Ctx. Length
  Minerva-350M          352M       16             1152            4032              16              4             2048                  16,384
  Minerva-1B            1.01B      16             2048            7168              16              4             2048                  16,384
  Minerva-3B            2.89B      32             2560            8960              32              8             2048                  16,384
  Minerva-7B            7.40B      32             4096           14336              32              8             None                   4,096

Table 3
Overview of the main hyperparameters for our Minerva models. We include the number of parameters (approximately, 350M,
1B, 3B, and 7B) and the corresponding number of layers, hidden size, intermediate size, attention heads, key-value heads,
sliding window length, and maximum context length.


  Size    Name                      ARC-C         ARC-E      BoolQ    GSM8K        HS     MMLU          PIQA   SciQ      TQA     WG      AVG
  0.4B    Minerva-350M-base-v1.0        24.6       36.4       60.7       48.2      32.6      25.7       59.5    63.7     46.5    58.4     45.6
    1B    Minerva-1B-base-v1.0          26.6       42.2       57.1       49.7      39.6      27.0       62.9    73.5     44.6    60.0     48.3
   3B     OpenELM-3B                    27.0       37.9       60.9       49.7      40.7      28.3       56.7    81.8     47.3    58.4     48.9
   3B     XGLM-2.9B                     27.5       41.4       59.1       65.7      44.5      27.4       59.9    77.8     43.1    60.2     50.6
   3B     Minerva-3B-base-v1.0          31.4       49.1       62.1       55.8      52.9      29.2       66.9    79.9     41.4    62.2     53.1
   7B     OLMo-7B-0724-hf               30.7       44.0       72.9       52.5      47.9      30.9       58.7    85.1     44.6    61.2     52.8
   7B     LLaMAntino-2-7b               33.7       50.8       70.9       52.2      54.9      33.8       64.4    86.1     44.3    64.1     55.5
   7B     Minerva-7B-base-v1.0          42.0       68.8       79.5       50.0      62.6      36.2       69.8    87.7     38.5    65.0     60.0
   7B     Mistral-7B-v0.1               42.8       61.3       78.2       56.1      60.4      38.0       65.5    90.8     43.5    68.8     60.5
   8B     Llama-3.1-8B                  44.0       61.1       78.0       57.8      62.9      38.7       67.7    90.3     43.0    69.2     61.3

Table 4
Zero-shot evaluation results of the Minerva models on a set of standard benchmarks translated from English to Italian.



interactions, coreference, and math reasoning, among                     on ITA-Bench every 10,000 training steps, the model is
others. Employing automatically-translated benchmarks                    still slowly improving towards the end of the pretraining
is far from ideal, but it allows us to better compare the                phase, suggesting that a larger training corpus or multi-
scores obtained in Italian with those obtained in English,               ple epochs may be beneficial in future developments.
while awaiting as the Italian research community devel-
ops Italian-specific benchmarks [40].
   As shown in Table 4, the average performance of the                   5. Downstream tasks
Minerva models increases steadily with the model size.
                                                                         In this section, we show the results of the Minerva mod-
For our 3B model, we also provide a comparison with
                                                                         els when adapted to two downstream applications. This
two models of the same size: XGLM [41], a multilingual
                                                                         analysis is particularly relevant for Minerva-350M and
LLM by META, and OpenELM [42], a very recent English-
                                                                         Minerva-1B, which can be utilized for specific tasks rather
only model developed by Apple. Our evaluation shows
                                                                         than as general-purpose models, offering lower computa-
that Minerva-3B outperforms XGLM and OpenELM by a
                                                                         tional costs. The tasks in this analysis include: i) Italian
significant margin, i.e., +4.4% and +3.7% on average.
                                                                         Abstractive News Summarization, and ii) Machine Trans-
   Finally, Minerva-7B achieves the highest performance
                                                                         lation, in both directions (IT-EN and EN-IT).
among the Minerva LLMs family, as expected. No-
tably, Minerva-7B, achieves a higher average score than
Llamantino-2. This is an interesting comparison be-                  News Summarization. Following Sarti and Nissim
cause the pretraining data for Llama-2, i.e., the pretrained         [43], we fine-tune Minerva models (up to 3B) on a con-
LLM used to build Llamantino-2, is not available and has             catenation of two Italian news summarization datasets:
never been disclosed, making the model open-weights                  Fanpage.it and Il Post newspapers [44]. A detailed
but not entirely open-source.4 When compared to closed-              overview of the hyperparameters used to train our mod-
sourced LLMs such as Mistral-7B-v0.1 or Llama-3.1-8B,                els is provided in Appendix E. We can find that Minerva-
Minerva still lags behind in some tasks, such as BoolQ               3B obtains the best results (0.30 vs 0.29 of the second best
or GSM8K, which may require better reasoning capabil-                in terms of Rouge-L); however, it is not as parameter-
ities and/or more pretraining data. As we can observe                efficient as IT5-Large, probably because encoder-decoder
from Figure 1, which tracks the progress of Minerva-7B               models are more suitable for fine-tuning than decoder-
                                                                     only models [45]. In Table 8, we report the full results of
4
 We stress that, for Llamantino-2, only the data that has been used Minerva fine-tuned on the aforementioned datasets and
  for the language adaptation process is available, whereas the pre- compared to baselines in Sarti and Nissim [43], which
training data is not.
                                                  Minerva-7B-base-v1.0
                                     Progress over time: average accuracy on ITA-Bench




Figure 1: Tracking the progress of Minerva-7B during its pretraining process. Here, we report the average accuracy on
ITA-Bench every 10,000 steps, i.e., every 40B tokens approximately.



include mBART, mT5, and IT5.                                                              FLORES               OPUS
                                                                Model                 EN-IT ↑   IT-EN ↑   EN-IT ↑   IT-EN ↑
Machine Translation. We also evaluate our Minerva               Minerva-1B             66.37     73.72     57.40     64.61
                                                                Minerva-3B             84.83     85.67     76.74     78.04
LLMs in few-shot [46] machine translation on two bench-         Minerva-7B             87.02     87.20     79.07     79.91
marks, FLORES [47] and OPUS-100 [48]. We explore                Gemma-2B               83.31     86.51     75.05     78.94
how LLMs perform this task relying only on in-context-          Qwen-1.5B              80.18     86.16     74.01     78.95
learning few-shot examples, reporting our results with          TinyLlama-1.1B-v1.1    73.40     83.62     65.72     75.44
                                                                LLaMa-2-7B             85.24     87.47     77.30     80.36
5-shot prompting. We rely on the vLLM library [28] and          Mistral-7B             86.56     87.75     78.08     80.56
change the default parameters with temperature=0 and            Qwen-7B                86.00     87.66     78.50     81.21
max_tokens=512.
   We highlight that Minerva-3B reaches competitive re-       Table 5
                                                              COMET scores measure the translation capabilities of our
sults in MT in both EN-IT (84.8 on Flores and 76.7 on
                                                              Minerva models and other LLMs on the FLORES and OPUS
Opus in terms of COMET score) and IT-EN (85.7 and 78.0).      datasets. This evaluation is conducted in a 5-shot setting,
Compared with other models of similar size, Minerva-3B        where each model receives five random translation examples
shows strong results when the target language is Italian      from the development set before the test instance.
(+1.7 and +2.7 compared to Gemma-2B and Qwen-1.5B
on Opus). Minerva-7B further showcases this by achiev-
ing the highest performance among models tested when
                                                            minerva) showcase promising results on a variety of Ital-
translating from English into Italian. The full results are
                                                            ian benchmarks and downstream tasks, including news
reported in Table 5.
                                                            summarization and machine translation. Most impor-
                                                            tantly, we describe, for the first time, the process of cre-
6. Conclusion and Future Work                               ating an Italian pretraining corpus with more than 1T
                                                            tokens, and we share findings and insights into the pre-
In this paper, we demonstrated the feasibility and bene- training process of Italian LLMs with the academic and
fits of pretraining Italian language models from scratch, industrial communities, paving the way for future re-
which not only improves the computational efficiency search in training non-English language models. We
and performance of an LLM for a target language but re- hope that our contributions will represent a stepping
duce linguistic biases inherited from English training cor- stone for future work on language-specific and multilin-
pora [49]. The Minerva models (https://nlp.uniroma1.it/ gual large-scale language modeling.
Acknowledgments                                                [7] K. Dobler, G. de Melo, FOCUS: Effective embed-
                                                                   ding initialization for monolingual specialization
Edoardo Barba, Simone Conia and Pere-Lluís Huguet                  of multilingual models, in: H. Bouamor, J. Pino,
Cabot are fully funded by the PNRR MUR project                     K. Bali (Eds.), Proceedings of the 2023 Conference
PE0000013-FAIR. Roberto Navigli acknowledges the sup-              on Empirical Methods in Natural Language Pro-
port of the CREATIVE PRIN project. The authors ac-                 cessing, Association for Computational Linguis-
knowledge the CINECA award IsB28_medit under the                   tics, Singapore, 2023, pp. 13440–13454. URL: https:
ISCRA initiative for the availability of high-performance          //aclanthology.org/2023.emnlp-main.829. doi:10.
computing resources and support.                                   18653/v1/2023.emnlp- main.829 .
                                                               [8] Z. Csaki, B. Li, J. Li, Q. Xu, P. Pawakapan, L. Zhang,
                                                                   Y. Du, H. Zhao, C. Hu, U. Thakker, Sambalingo:
References                                                         Teaching large language models new languages,
 [1] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam-              arXiv preprint arXiv:2404.05829 (2024).
     ford, D. S. Chaplot, D. de las Casas, F. Bressand,        [9] M. Faysse, P. Fernandes, N. Guerreiro, A. Loison,
     G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-         D. Alves, C. Corro, N. Boizard, J. Alves, R. Rei,
     A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang,         P. Martins, et al., Croissantllm: A truly bilingual
     T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https:        french-english language model, arXiv preprint
     //arxiv.org/abs/2310.06825. arXiv:2310.06825 .                arXiv:2402.00786 (2024).
 [2] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.    [10] A. Bacciu, C. Campagnano, G. Trappolini, F. Sil-
     Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham-            vestri, DanteLLM: Let’s push Italian LLM research
     bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave,             forward!, in: N. Calzolari, M.-Y. Kan, V. Hoste,
     G. Lample, Llama: Open and efficient foundation               A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of
     language models, 2023. URL: https://arxiv.org/abs/            the 2024 Joint International Conference on Com-
     2302.13971. arXiv:2302.13971 .                                putational Linguistics, Language Resources and
 [3] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-          Evaluation (LREC-COLING 2024), ELRA and ICCL,
     hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-            Torino, Italia, 2024, pp. 4343–4355. URL: https:
     gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer,         //aclanthology.org/2024.lrec-main.388.
     M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu,    [11] A. Santilli, E. Rodolà, Camoscio: an italian
     W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal,               instruction-tuned llama, 2023. arXiv:2307.16456 .
     A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kar-      [12] D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia,
     das, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko-                R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Mag-
     renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee,         nusson, Y. Wang, S. Arora, D. Atkinson, R. Au-
     D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov,        thur, K. R. Chandu, A. Cohan, J. Dumas, Y. Elazar,
     P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen-         Y. Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison,
     stein, R. Rungta, K. Saladi, A. Schelten, R. Silva,           N. Muennighoff, A. Naik, C. Nam, M. E. Peters,
     E. M. Smith, R. Subramanian, X. E. Tan, B. Tang,              V. Pyatkin, A. Ravichander, D. Schwenk, S. Shah,
     R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan,            W. Smith, E. Strubell, N. Subramani, M. Wortsman,
     I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang,           P. Dasigi, N. Lambert, K. Richardson, L. Zettlemoyer,
     A. Rodriguez, R. Stojnic, S. Edunov, T. Scialom,              J. Dodge, K. Lo, L. Soldaini, N. A. Smith, H. Ha-
     Llama 2: Open foundation and fine-tuned chat                  jishirzi, Olmo: Accelerating the science of language
     models, 2023. URL: https://arxiv.org/abs/2307.09288.          models, 2024. URL: https://arxiv.org/abs/2402.00838.
     arXiv:2307.09288 .                                            arXiv:2402.00838 .
 [4] P. Basile, E. Musacchio, M. Polignano, L. Siciliani,     [13] S. Mehta, M. H. Sekhavat, Q. Cao, M. Horton, Y. Jin,
     G. Fiameni, G. Semeraro, Llamantino: Llama 2                  C. Sun, I. Mirzadeh, M. Najibi, D. Belenko, P. Zat-
     models for effective text generation in italian lan-          loukal, M. Rastegari, Openelm: An efficient lan-
     guage, 2023. URL: https://arxiv.org/abs/2312.09993.           guage model family with open training and infer-
     arXiv:2312.09993 .                                            ence framework, 2024. URL: https://arxiv.org/abs/
 [5] M. Polignano, P. Basile, G. Semeraro, Advanced                2404.14619. arXiv:2404.14619 .
     natural-based interaction for the italian language:      [14] T. Nguyen, C. V. Nguyen, V. D. Lai, H. Man, N. T.
     Llamantino-3-anita, 2024. arXiv:2405.07101 .                  Ngo, F. Dernoncourt, R. A. Rossi, T. H. Nguyen,
 [6] M. Ostendorff, G. Rehm, Efficient language model              Culturax: A cleaned, enormous, and multilingual
     training through cross-lingual and progressive                dataset for large language models in 167 languages,
     transfer learning, arXiv preprint arXiv:2301.09626            2023. arXiv:2309.09400 .
     (2023).                                                  [15] T. Computer, Redpajama: an open dataset for
                                                                   training large language models, 2023. URL: https:
     //github.com/togethercomputer/RedPajama-Data.                   Smith, J. Dodge, What’s in my big data?, in: The
[16] G. Penedo, Q. Malartic, D. Hesslow, R. Cojo-                    Twelfth International Conference on Learning Rep-
     caru, A. Cappelli, H. Alobeidli, B. Pannier, E. Al-             resentations, 2024. URL: https://openreview.net/
     mazrouei, J. Launay, The RefinedWeb dataset                     forum?id=RvfPnOkPV4.
     for Falcon LLM: outperforming curated corpora              [22] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru,
     with web data, and web data only, arXiv preprint                A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei,
     arXiv:2306.01116 (2023). URL: https://arxiv.org/abs/            J. Launay, The refinedweb dataset for falcon llm:
     2306.01116. arXiv:2306.01116 .                                  Outperforming curated corpora with web data, and
[17] M. Faysse, P. Fernandes, N. M. Guerreiro, A. Loison,            web data only, 2023. URL: https://arxiv.org/abs/
     D. M. Alves, C. Corro, N. Boizard, J. Alves, R. Rei,            2306.01116. arXiv:2306.01116 .
     P. H. Martins, A. B. Casademunt, F. Yvon, A. F. T.         [23] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyan-
     Martins, G. Viaud, C. Hudelot, P. Colombo, Crois-               skiy, F. Lebrón, S. Sanghai, Gqa: Training general-
     santllm: A truly bilingual french-english language              ized multi-query transformer models from multi-
     model, 2024. URL: https://arxiv.org/abs/2402.00786.             head checkpoints, arXiv preprint arXiv:2305.13245
     arXiv:2402.00786 .                                              (2023).
[18] G. Penedo, H. Kydlíček, L. B. allal, A. Lozhkov,           [24] R. Child, S. Gray, A. Radford, I. Sutskever, Generat-
     M. Mitchell, C. Raffel, L. V. Werra, T. Wolf, The               ing long sequences with sparse transformers, arXiv
     fineweb datasets: Decanting the web for the finest              preprint arXiv:1904.10509 (2019).
     text data at scale, 2024. URL: https://arxiv.org/abs/      [25] I. Beltagy, M. E. Peters, A. Cohan, Longformer:
     2406.17557. arXiv:2406.17557 .                                  The long-document transformer, arXiv preprint
[19] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu,           arXiv:2004.05150 (2020).
     M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku-         [26] G. Gerganov, llama.cpp: Inference of meta’s llama
     mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. A.                model (and others) in pure c/c++, ???? URL: https:
     Cosgrove, C. D. Manning, C. Re, D. Acosta-Navas,                //github.com/ggerganov/llama.cpp.
     D. A. Hudson, E. Zelikman, E. Durmus, F. Lad-              [27] T. Dao, FlashAttention-2: Faster attention with
     hak, F. Rong, H. Ren, H. Yao, J. WANG, K. San-                  better parallelism and work partitioning, in: Inter-
     thanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suz-               national Conference on Learning Representations
     gun, N. Kim, N. Guha, N. S. Chatterji, O. Khat-                 (ICLR), 2024.
     tab, P. Henderson, Q. Huang, R. A. Chi, S. M. Xie,         [28] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H.
     S. Santurkar, S. Ganguli, T. Hashimoto, T. Icard,               Yu, J. E. Gonzalez, H. Zhang, I. Stoica, Efficient mem-
     T. Zhang, V. Chaudhary, W. Wang, X. Li, Y. Mai,                 ory management for large language model serving
     Y. Zhang, Y. Koreeda, Holistic evaluation of lan-               with pagedattention, in: Proceedings of the ACM
     guage models, Transactions on Machine Learn-                    SIGOPS 29th Symposium on Operating Systems
     ing Research (2023). URL: https://openreview.net/               Principles, 2023.
     forum?id=iO4LZibEqW, featured Certification, Ex-           [29] I. Loshchilov, F. Hutter, Decoupled weight decay
     pert Certification.                                             regularization, arXiv preprint arXiv:1711.05101
[20] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-            (2017).
     Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei,     [30] L. Moroni, S. Conia, F. Martelli, R. Navigli, ITA-
     T. Liu, M. Tian, D. Kocetkov, A. Zucker, Y. Belkada,            Bench: Towards a more comprehensive evaluation
     Z. Wang, Q. Liu, D. Abulkhanov, I. Paul, Z. Li, W.-D.           for Italian LLMs, in: CLiC-it, 2024.
     Li, M. Risdal, J. Li, J. Zhu, T. Y. Zhuo, E. Zheltonozh-   [31] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sab-
     skii, N. O. O. Dade, W. Yu, L. Krauß, N. Jain, Y. Su,           harwal, C. Schoenick, O. Tafjord, Think you have
     X. He, M. Dey, E. Abati, Y. Chai, N. Muennighoff,               solved question answering? try arc, the ai2 rea-
     X. Tang, M. Oblokulov, C. Akiki, M. Marone, C. Mou,             soning challenge, arXiv preprint arXiv:1803.05457
     M. Mishra, A. Gu, B. Hui, T. Dao, A. Zebaze,                    (2018).
     O. Dehaene, N. Patry, C. Xu, J. McAuley, H. Hu,            [32] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski,
     T. Scholak, S. Paquet, J. Robinson, C. J. Anderson,             M. Collins, K. Toutanova, Boolq: Exploring the
     N. Chapados, M. Patwary, N. Tajbakhsh, Y. Jer-                  surprising difficulty of natural yes/no questions,
     nite, C. M. Ferrandis, L. Zhang, S. Hughes, T. Wolf,            arXiv preprint arXiv:1905.10044 (2019).
     A. Guha, L. von Werra, H. de Vries, Starcoder              [33] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen,
     2 and the stack v2: The next generation, 2024.                  H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton,
     arXiv:2402.19173 .                                              R. Nakano, et al., Training verifiers to solve math
[21] Y. Elazar, A. Bhagia, I. H. Magnusson, A. Ravichan-             word problems, arXiv preprint arXiv:2110.14168
     der, D. Schwenk, A. Suhr, E. P. Walsh, D. Groen-                (2021).
     eveld, L. Soldaini, S. Singh, H. Hajishirzi, N. A.         [34] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi,
     Hellaswag: Can a machine really finish your sen-             terpreting language model as a regularized encoder-
     tence?, arXiv preprint arXiv:1905.07830 (2019).              decoder, arXiv preprint arXiv:2304.04052 (2023).
[35] D. Hendrycks, C. Burns, S. Basart, A. Zou,              [46] X. Garcia, Y. Bansal, C. Cherry, G. Foster, M. Krikun,
     M. Mazeika, D. Song, J. Steinhardt, Measuring mas-           M. Johnson, O. Firat, The unreasonable effective-
     sive multitask language understanding, Proceed-              ness of few-shot learning for machine translation,
     ings of the International Conference on Learning             in: International Conference on Machine Learning,
     Representations (ICLR) (2021).                               PMLR, 2023, pp. 10867–10878.
[36] Y. Bisk, R. Zellers, J. Gao, Y. Choi, et al., Piqa:     [47] N. Goyal, C. Gao, V. Chaudhary, P.-J. Chen, G. Wen-
     Reasoning about physical commonsense in natural              zek, D. Ju, S. Krishnan, M. Ranzato, F. Guzmán,
     language, in: Proceedings of the AAAI confer-                A. Fan, The flores-101 evaluation benchmark for
     ence on artificial intelligence, volume 34, 2020, pp.        low-resource and multilingual machine translation,
     7432–7439.                                                   Transactions of the Association for Computational
[37] J. Welbl, N. F. Liu, M. Gardner, Crowdsourcing               Linguistics 10 (2022) 522–538.
     multiple choice science questions, arXiv preprint       [48] B. Zhang, P. Williams, I. Titov, R. Sennrich, Improv-
     arXiv:1707.06209 (2017).                                     ing massively multilingual neural machine trans-
[38] S. Lin, J. Hilton, O. Evans, Truthfulqa: Measur-             lation and zero-shot translation, arXiv preprint
     ing how models mimic human falsehoods, arXiv                 arXiv:2004.11867 (2020).
     preprint arXiv:2109.07958 (2021).                       [49] R. Navigli, S. Conia, B. Ross, Biases in large lan-
[39] K. Sakaguchi, R. L. Bras, C. Bhagavatula, Y. Choi,           guage models: Origins, inventory, and discussion,
     Winogrande: An adversarial winograd schema chal-             J. Data and Information Quality 15 (2023) 1–21.
     lenge at scale, Communications of the ACM 64                 URL: https://doi.org/10.1145/3597307. doi:10.1145/
     (2021) 99–106.                                               3597307 .
[40] F. Mercorio, M. Mezzanzanica, D. Potertì, A. Serino,    [50] S. Conia, M. Li, D. Lee, U. Minhas, I. Ilyas,
     A. Seveso, Disce aut deficere: Evaluating llms               Y. Li, Increasing coverage and precision of tex-
     proficiency on the INVALSI Italian benchmark,                tual information in multilingual knowledge graphs,
     2024. URL: https://arxiv.org/abs/2406.17535.                 in: H. Bouamor, J. Pino, K. Bali (Eds.), Pro-
     arXiv:2406.17535 .                                           ceedings of the 2023 Conference on Empirical
[41] X. V. Lin, T. Mihaylov, M. Artetxe, T. Wang,                 Methods in Natural Language Processing, Asso-
     S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhos-                ciation for Computational Linguistics, Singapore,
     ale, J. Du, R. Pasunuru, S. Shleifer, P. S. Koura,           2023, pp. 1612–1634. URL: https://aclanthology.
     V. Chaudhary, B. O’Horo, J. Wang, L. Zettlemoyer,            org/2023.emnlp-main.100. doi:10.18653/v1/2023.
     Z. Kozareva, M. Diab, V. Stoyanov, X. Li, Few-               emnlp- main.100 .
     shot learning with multilingual generative lan-         [51] S. Conia, D. Lee, M. Li, U. F. Minhas, S. Potdar,
     guage models, in: Y. Goldberg, Z. Kozareva,                  Y. Li, Towards cross-cultural machine translation
     Y. Zhang (Eds.), Proceedings of the 2022 Con-                with retrieval-augmented generation from multi-
     ference on Empirical Methods in Natural Lan-                 lingual knowledge graphs, in: Proceedings of the
     guage Processing, Association for Computational              2024 Conference on Empirical Methods in Natu-
     Linguistics, Abu Dhabi, United Arab Emirates,                ral Language Processing, Association for Computa-
     2022, pp. 9019–9052. URL: https://aclanthology.              tional Linguistics, Miami, Florida, USA, 2024. URL:
     org/2022.emnlp-main.616. doi:10.18653/v1/2022.               https://arxiv.org/abs/2410.14057.
     emnlp- main.616 .
[42] S. Mehta, M. H. Sekhavat, Q. Cao, M. Horton, Y. Jin,
     C. Sun, I. Mirzadeh, M. Najibi, D. Belenko, P. Zat-
     loukal, M. Rastegari, OpenELM: An Efficient Lan-
                                                             A. Data sources
     guage Model Family with Open Training and In-           Table 6 shows the source of each dataset used to train
     ference Framework, arXiv.org (2024). URL: https:        Minerva in its different sizes. The Tokens column shows
     //arxiv.org/abs/2404.14619v1.                           the total number of tokens we used from each dataset.
[43] G. Sarti, M. Nissim, It5: Large-scale text-to-text      Where Table 1 shows more tokens used for training, it
     pretraining for italian language understanding and      means they were resampled from the total in order to
     generation, arXiv preprint arXiv:2203.03759 (2022).     reach that number. All these datasets are openly licensed.
[44] N. Landro, I. Gallo, R. La Grassa, E. Federici, Two
     new datasets for italian-language abstractive text
     summarization, Information 13 (2022) 228.
[45] Z. Fu, W. Lam, Q. Yu, A. M.-C. So, S. Hu, Z. Liu,
     N. Collier, Decoder-only or encoder-decoder? in-
     Dataset                Tokens     Language        Genre                                       URL
     RedPajama-Data-V2         688B        Italian       Web        https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
     CulturaX                  158B        Italian       Web                     https://huggingface.co/datasets/uonlp/CulturaX
     Wikipedia                  1.3B       Italian   Encyclopedic             https://huggingface.co/datasets/wikimedia/wikipedia
     Gutenberg                0.15B        Italian      Books               https://huggingface.co/datasets/manu/project_gutenberg
     Wikisource               0.12B        Italian      Books                 https://huggingface.co/datasets/wikimedia/wikisource
     EurLex                     1.6B       Italian       Law                 https://huggingface.co/datasets/joelito/eurlex_resources
     Gazzetta Ufficiale         1.7B       Italian       Law                https://huggingface.co/datasets/mii-llm/gazzetta-ufficiale
     FineWeb                 1,076B       English        Web               https://huggingface.co/datasets/HuggingFaceFW/fineweb
     CulturaX                  330B       English        Web                     https://huggingface.co/datasets/uonlp/CulturaX
     Wikipedia                  5.3B      English    Encyclopedic             https://huggingface.co/datasets/wikimedia/wikipedia
     ArXiv                      33B       English     Academic      https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
     Gutenberg                    7B      English       Books               https://huggingface.co/datasets/manu/project_gutenberg
     StackExchange              22B       English       Forum       https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
     The Stack V2              201B          Code        Code        https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids

Table 6
Detailed breakdown of each dataset.



B. Dataset Insights                                                    to 7 billion parameters of the largest model, Minerva-
                                                                       7B. The Minerva family also includes Minerva-1B and
We leveraged the WIMBD5 library to compute word                        Minerva-3B, with 1 billion and 3 billion parameters, re-
counts per URL domain on CulturaX. We decided not                      spectively. More specifically, the Minerva-7B model is
to do this for RedPajama v2 or FineWeb as their origi-                 based directly on the Mistral-7B architecture, with the
nal data already provides token count and other insights               sole modifications being the vocabulary size, which we
into the dataset distribution. Figures 2 and 3 show the                increase to 51,200 tokens, and the context length, which is
aggregation of word counts per domain for Italian and                  set to 4,096 tokens without activating the sliding window
English, respectively.                                                 attention feature. Hence, Minerva-7B is structured as a
                                                                       decoder-only transformer model, comprising 32 layers.
                                                                       Each layer includes 32 attention heads, where each key-
C. Tokenizer                                                           value pair is shared among four queries. Additionally,
                                                                       the model features feed-forward layers with a hidden
We trained two tokenizers for Minerva. The first one is
                                                                       size of 4096 and an intermediate size of 14336, which is
shared by the three smaller sizes, 350M, 1B and 3B. It is
                                                                       3.5 times the hidden size. Minerva-3B is a scaled down
trained on a mix of 4GB of Italian text data and 4GB of
                                                                       version of Minerva-7B, and it shares similar features with
English text data, both from CulturaX. Our objective is
                                                                       Mistral-7B, including a maximum context length of 16,384
to have a balanced vocabulary across the two languages,
                                                                       tokens, sliding window attention spanning 2,048 tokens,
mirroring the training data. We use the SentencePiece
                                                                       and a vocabulary size of 32,768 tokens. To achieve ap-
library6 to train a BPE tokenizer and we apply byte fall-
                                                                       proximately 3 billion parameters, we have reduced the
back. We set a vocabulary size of 32,768 as a multiple of
                                                                       hidden size to 2560 and the intermediate size to 8960.
8, which is recommended by some GPU architectures.
                                                                       Minerva-1B and Minerva-350M differ from their larger
   For the 7B tokenizer, we increase the vocabulary size
                                                                       counterpart in several key respects. Both models have
to account for the inclusion of code data, up to 51,200.
                                                                       16 attention heads, in contrast to the higher count in the
We also train a BPE tokenizer7 with 4GB of English text,
                                                                       larger model. Additionally, the hidden and intermedi-
4GB of Italian and 1GB of code. The text data is sampled
                                                                       ate sizes of the feed-forward layers is reduced further:
from the training mix of datasets for the 7B, as reported
                                                                       Minerva-1B features a hidden size of 2048 and an interme-
in Table 1.
                                                                       diate size of 7168, while Minerva-350M has a hidden size
                                                                       of 1152 and an intermediate size of 4032. The complete
D. Model                                                               list of parameters is reported in Table 3.

The Minerva LLM family consists of four models, each
sharing the same underlying architecture, i.e., that of                E. News Summarization
Mistral-7B. The models are differentiated by their size,
                                                                       Additional results. Table 8 reports the full results of
ranging from 350 million parameters of Minerva-350M
                                                                       our evaluation on news summarization.
5
    https://github.com/allenai/wimbd
6
    https://github.com/google/sentencepiece
7
    https://huggingface.co/docs/tokenizers/en/api/trainers
                                                                                    Top-50 source URLs (# of words)
                            docplayer.it
                       radioradicale.it
                           tripadvisor.it
                      it.wikipedia.org
             it.blastingnews.com
                 it.m.wikipedia.org
                     ilsussidiario.net
                                treccani.it
       sentenze.laleggepertutti.it
                              ilgiornale.it
                           slideplayer.it
                          it.scribd.com
                            247.libero.it
                ilfattoquotidiano.it
                                      kijiji.it
                                issuu.com
                           repubblica.it
                          medicitalia.it
                                 corriere.it
                  it.paperblog.com
                             lastampa.it
                   blitzquotidiano.it
                   huffingtonpost.it
                                tg24.sky.it
                        immobiliare.it
URL




                            it.jooble.org
                             it.topwar.ru
                   laleggepertutti.it
                                       ibs.it
                     ilsole24ore.com
                        slideshare.net
                  it.knowledgr.com
                       informazione.it
                    tuttosu.virgilio.it
                       spaziogames.it
             it.notizie.yahoo.com
                                   airbnb.it
                                   caasa.it
           blog.giallozafferano.it
                                    icase.it
                                  jetcost.it
                                amazon.it
                          affaritaliani.it
                            emagister.it
                         it.venere.com
                                  ilfoglio.it
                                     ansa.it
                               sport.sky.it
                        artribune.com
            documenti.camera.it
                                            0.000              0.002                  0.004                 0.006                   0.008                 0.010
                                                                         Percentage of words from each URL over the total 117B words

Figure 2: Domain word count distribution for Italian CulturaX.


     Model                       Optimizer             lr        betas       eps     Weight Decay         Scheduler      Warm-up            Batch Size   Steps
     Minerva-350M                  AdamW            2 × 10−4   (0.9, 0.95)   10−8           0.0             Cosine            2%               4𝑀         16, 690
     Minerva-1B                    AdamW            2 × 10−4   (0.9, 0.95)   10−8           0.0             Cosine            2%               4𝑀         47, 684
     Minerva-3B                    AdamW            2 × 10−4   (0.9, 0.95)   10−8           0.0             Cosine            2%               4𝑀        157, 357
     Minerva-7B                    AdamW            3 × 10−4   (0.9, 0.95)   10−5           0.1             Cosine           2000              4𝑀        591, 558

Table 7
Training configuration for various Minerva models.



Additional details on the experimental setup. To                                          reported parameters. Furthermore, we want to highlight
finetune our Minerva models we relied on the SFTTrainer                                   that Minerva-350M and Minerva-1B were finetuned using
class.8 The hyperparameters we used are reported in                                       AdamW optimizer [29]. Minerva-3B was trained using
Table 9. We sought to be in-line with the decisions taken                                 AdamW_Paged_32bit, a lighter version of AdamW, which
in [43]. We also tried out different combinations, but we                                 allows a larger batch size to be used during training.
noticed that the best evaluation scores are given by the
8
    https://huggingface.co/docs/trl/en/sft_trainer
                                                                  Top-50 source URLs (# of words)
                                google.com
                                  issuu.com
                                 scribd.com
                           tripadvisor.com
                    patents.google.com
                              docplayer.net
                         en.wikipedia.org
                         tripadvisor.co.uk
                             slideshare.net
                               amazon.com
                               nytimes.com
                            dailymail.co.uk
                                  scout.com
                      stackoverflow.com
                                 archive.org
             patentsencyclopedia.com
                         journals.plos.org
                        theguardian.com
                             frontiersin.org
                      seekingalpha.com
                             law.justia.com
                         ncbi.nlm.nih.gov
                   washingtonpost.com
                           slideplayer.com
                       link.springer.com
URL




                                nature.com
                              fanfiction.net
                  barnesandnoble.com
                                 reddit.com
                               hindawi.com
                      patents.justia.com
                              amazon.co.uk
                               medium.com
                     finance.yahoo.com
                            hubpages.com
         s3-us-west-1.amazonaws.com
      shameface.com.wstub.archive.org
                              casetext.com
                      thefreelibrary.com
                    bleacherreport.com
                          rightmove.co.uk
                                 prweb.com
                               airbnb.co.uk
                    semanticscholar.org
           articles.chicagotribune.com
                               expedia.com
                          studymode.com
               forums.macrumors.com
                           telegraph.co.uk
            publications.parliament.uk
                                          0.000   0.001                  0.002                  0.003                 0.004
                                                       Percentage of words from each URL over the total 2096B words

Figure 3: Domain word count distribution for English CulturaX.



F. Few-shot Machine Translation                                      erence, as COMET has shown better correlation with
                                                                     human judgement than other metrics, such as BLEU.
Here, we provide more details on our experimental setup
for the Machine Translation task. In our experiments, we
test the capability of a base model (i.e., with no instruc-
tion fine-tuning or task-specific fine-tuning) to translate
a sentence from English to Italian and vice versa. Previ-
ously, LLMs have been shown to perform well in machine
translation and they now rival task-specific MT systems
on a number of benchmarks [50] and tasks [51]. In our
case, we prompt the language models by providing a set
of 5 randomly sampled English-to-Italian translations
(and vice-versa for the Italian-to-English translation). Fi-
nally, we measure the translation performance of the
models using COMET, a learned metric to assess the
quality between an automatic translation and a gold ref-
          Model            R1 ↑    R2 ↑   RL ↑
          mBART Large      0.32    0.15    0.25
          mT5 Small        0.34    0.16    0.26
          mT5 Base         0.33    0.16    0.26
          IT5 Small        0.35    0.17    0.28
          IT5 EL32         0.34    0.16    0.26
          IT5 Base         0.25    0.10    0.20
          IT5 Large        0.38    0.19    0.29
          Minerva-350M     0.35    0.17    0.27
          Minerva-1B       0.35    0.17    0.27
          Minerva-3B       0.39    0.20    0.30

Table 8
Rouge metrics of News Summarization fine-tuning.




  Parameter                                        Value
  warmup ratio                                     0.2
  weight decay                               5 × 10−3
  batch size                                       64
  optimizer       AdamW | PagedAdamW 32bit (only 3B)
  learning rate                                0.0005
  scheduler                                   Linear
  epochs                                             7

Table 9
Hyper-parameters used to fine-tune our models.