-

1613-0073

Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data

Riccardo Orlando

orlando@diag.uniroma1.it 1 2 4

Luca Moroni

moroni@diag.uniroma1.it 1 2 4

Pere-Lluís Huguet Cabot

huguetcabot@diag.uniroma1.it 1 2 4

Edoardo Barba

barba@diag.uniroma1.it 1 2 4

Simone Conia

conia@diag.uniroma1.it 1 2 4

Sergio Orlandini

s.orlandini@cineca.it 0 1 2

Giuseppe Fiameni

gfiameni@nvidia.com 1 2 3

Roberto Navigli

navigli@diag.uniroma1.it 1 2 4 0 CINECA , Bologna , Italy 1 CLiC-it 2024: Tenth Italian Conference on Computational Linguistics 2 Large Language Models , Language Modeling, Italian Language, LLM Pretraining 3 NVIDIA , Santa Clara, California , USA 4 Sapienza NLP Group, Dipartimento di Ingegneria Informatica, Automatica e Gestionale, Sapienza University of Rome , Italy

The growing interest in Large Language Models (LLMs) has accelerated research eforts to adapt these models for various languages. Despite this, pretraining LLMs from scratch for non-English languages remains underexplored. This is the case for Italian, where no truly open-source research has investigated the pretraining process. To address this gap, we introduce Minerva (https://nlp.uniroma1.it/minerva), the first family of LLMs trained entirely from scratch on native Italian texts. Our work is the first investigation into the challenges and opportunities of pretraining LLMs specifically for the Italian language, ofering insights into vocabulary design, data composition, and model development. With Minerva, we demonstrate that building an LLM tailored to a specific language yields numerous practical benefits over adapting existing multilingual models, including greater control over the model's vocabulary and the composition of its training data. We provide an overview of the design choices, pretraining methods, and evaluation metrics used to develop Minerva, which shows promising performance on Italian benchmarks and downstream tasks. Moreover, we share the lessons learned throughout Minerva's development to support the academic and industrial communities in advancing non-English LLM research. We believe that Minerva serves as an important step towards closing the gap in high-quality, open-source LLMs for non-English languages.

CEUR

ceur-ws.org existing English-centric LLMs to other languages, and

1. Introduction Large Language Models (LLMs) have revolutionized the

way Natural Language Processing (NLP) tasks are approached, achieving remarkable results in existing areas and opening the door to entirely new research directions and applications. As a result, the energy and resources ing Italian text using multilingual or language-adapted

English models, e.g., from Mistral [1] or Llama [2, 3], is

computationally more expensive and often less efective compared to using a model specifically designed for the

Italian language. This ineficiency stems from the vocab

units, or tokens, that the model can use to compose text – when it is not optimized for the Italian language, resulting in Italian words being split into an excessive number of tokens. Consequently, this creates longer sequences of tokens, slower generation times, and higher computational costs, especially since many popular attention mechanisms have a quadratic complexity with respect to

Eforts to create language-specific LLMs are increasing, ing existing English-centric LLMs to other languages are enticing: starting with a proven model can reduce the computational requirements, and adaptation can be achieved with relatively modest amounts of data. There are several language adaptation techniques, which range guage [4, 5] to modifying the model’s architecture [6, 7, 8], making these techniques flexible for diferent budgets and objectives. However, these techniques may not fully capture language-specific nuances and can degrade the performance in the original language, indeed an undesirable efect. Alternatively, training LLMs from scratch provides the freedom to make design choices tailored to the linguistic features of the target language—including morphology, lexicon, syntax, and semantics—which also allows for incorporating culturally relevant content, reducing biases that might be present in models primarily trained on English data, thus leading to more inclusive and accurate representations of language use.

Unfortunately, while there are several eforts on adapt

ing English-centric LLMs to the Italian language, e.g.,

Llamantino-2 [4], Llamantino-3 [5], DanteLLM [10], and Camoscio [11], inter alia, there is no truly open-source

endeavor exploring what can be achieved by training an

LLM from scratch on Italian data. With this work, we follow the latter path and introduce Minerva, the first family of LLMs designed specifically

for the Italian language and pretrained on Italian text.1

We present the design choices for our models, our data

processing, and the evaluation results regarding our Minerva LLMs, showing that our models – with 350M, 1B,

3B, and 7B parameters – outperform comparable multi

lingual models and even rival larger models adapted for

Italian. We conclude with a discussion on the benefits

and challenges of pretraining LLMs from scratch for the

Italian language, sharing our experience and findings to

provide valuable insights for the academic and industrial communities interested in training non-English LLMs from scratch. Lastly, we describe the technical details of

Minerva-7B, our latest model with 7.4 billion parameters, for which we share our initial results. 2. Building a Pretraining Dataset for Italian LLMs

The field of LLMs is growing at an astonishing pace, with new models, datasets, benchmarks, and techniques presented every week. However, over the past few months, academic and industrial researchers have increasingly recognized the fundamental role of the data used to pretrain LLMs. Unsurprisingly, the majority of the leading companies are not releasing their training data as they seek to maintain an advantage over the competition, with very few exceptions (e.g. OLMo by AllenAI [12] and OpenELM by Apple [13]). In this section, we describe the diferent sources of data used in the training of the Minerva models, and Table 1 provides an overview of these (cf. Appendix A for more details). Most importantly, the training datasets we used are entirely available online, making our process transparent and allowing researchers to better study the connection between pretraining data and model behavior.

2.1. Data Sources

Name RedPajama-V2 CulturaX Wikipedia Gutenberg Wikisource EurLex Gazzetta Uficiale FineWeb CulturaX Wikipedia ArXiv Gutenberg StackExchange The Stack V2 Total # of tokens

Lang. Web data. The majority of the text used to train LLMs is sourced from Web-scraped data, typically from CommonCrawl (CC). Therefore, a significant portion of Italian text included in our training datasets is also of this nature, inherently exposing our models to potential biases and toxic content commonly found on the Web. Because preprocessing techniques, such as language identification, perplexity filtering, deduplication, and content classification are computationally expensive, the most sensible choice is thus to rely on preprocessed collections, such as CulturaX [14] and RedPajama v2 [15]. These collections already include Italian data, and have undergone various levels of filtering and deduplication, as discussed in Section 2.2.

Curated data. While Penedo et al. [16] suggest that

high-quality Web data is suficient on its own to train LLMs, curated data sources are often used to further improve the model performance and introduce a broader diversity of data types, such as encyclopedic and academic text [17], as well as scientific and math-related text. Therefore, we include curated texts from several sources, including Wikipedia (encyclopedic/world knowledge data), EurLex and Gazzetta Uficiale (law, economics, and politics), and the Gutenberg Project (novels, poetry, etc.).

The training data for our Minerva models consists of

three main categories: Italian, English, and code data. 2.1.2. English Data

1https://nlp.uniroma1.it/minerva

Web data. Mirroring our approach with the Italian data, we use preprocessed collections of English data from the Web. Given that English is the most popular of text that appear too often so as to minimize memorizalanguage on the Internet and has been the primary focus tion. of LLM research, there are numerous options that already As mentioned above, for the corpus used to train the provide a large amount of tokens from filtered, dedupli- Minerva models, we rely mainly on collections of data cated, and cleaned sources. For our Minerva-350M, 1B, that has already been filtered and deduplicated. However, and 3B models, we collect data from the English partition there are some minor considerations that depend on each of CulturaX, capping the number of tokens to the same collection of data. More specifically, we use CulturaX amount as the Italian ones, as shown in Table 1. Instead, as-is, relying on their filtering and deduplication pipeline. to train Minerva-7B, we use a portion of FineWeb [18], Unfortunately, RedPajama v2 is not filtered and dedupliwhich includes filtered and deduplicated CC dumps with cated; however, its data is tagged with meta-information various timestamps. Specifically, we use the CC dumps that can be used to apply filtering and deduplication. from 2023-14 to 2024-18 to match the total number of Such metadata includes, for example, the perplexity score tokens in the Italian Web partition of our training data. of each text computed via a language model trained on

Wikipedia, which is used to partition RedPajama v2 into

Curated sources. We include the 5.3B tokens from three partitions: head, middle, tail. For our training corthe English Wikipedia and 7B tokens from the copyright- pus, we only include a document if it is classified as free books in Project Gutenberg. Additionally, we include head or middle according to its perplexity score. Moredata from arXiv and StackExchange, which are included over, we use the precomputed metadata to remove exact in the RedPajama dataset. duplicates and apply fuzzy deduplication. The latter is performed by using the hash provided for each document 2.1.3. Code Data with Locality Sensitive Hashing and Jaccard similarity 0.7 to decide whether two documents are fuzzy duplicates.

Previous work has highlighted the importance of includ- Note that we only apply fuzzy deduplication within each ing source code in the pretraining corpus of an LLM, CC dump, rather than across all the dumps. This decision in order to improve not only its code understanding is motivated by two observations: first, applying fuzzy and generation, but also its general reasoning capabil- deduplication across all CC dumps is computationally ities [19] even for tasks that do not directly involve or expensive; second, previous work [18] has shown that require programming. Therefore, for our largest model per-CC deduplication is not only suficient, but is also – Minerva-7B – we also include a portion of code data. beneficial, when training English LLMs.

More specifically, we extract 200B tokens from The Stack V2 [20], selecting the data from their deduplicated parti

tion, which includes 17 of the most popular programming 3. Minerva LLMs languages on GitHub.

2.2. Data Preprocessing As mentioned above, our preprocessing efort remains

minimal, as we rely on the preprocessing pipelines used in CulturaX, RedPajama, and FineWeb. To evaluate the content and quality of our training data, we employ the methodology described in Elazar et al. [21] to analyze the

URL domain distribution within the Italian partition of CulturaX and RedPajama, as these partitions had never been utilized in training an LLM prior to Minerva. We provide an overview of our analysis together with a few insights in Appendix B. The vocabulary of an LLM is mainly impacted by its size,

i.e., the number of tokens in the vocabulary itself, and how the tokenizer is trained, i.e., which tokens make up the vocabulary. These two factors impact the fertility of the resulting tokenizer, which measures the average number of tokens (subwords) into which a word is split.

Tokenizers with lower fertility are preferable, as the input

and output sequences they produce are shorter, result2.3. Data Filtering and Deduplication ing in an eficiency gain, especially as most attention Previous work on English-centric LLMs [22] has already mechanisms are quadratic with respect to the sequence emphasized the importance of training LLMs on “clean” length. Unsurprisingly, the vocabulary allocation of an data. Two of the most important parts of data cleaning English-centric LLM minimizes the fertility of English are filtering, i.e., removing content that does not satisfy a text, and results in high fertility values for Italian text, as set of criteria, and deduplication, i.e., removing portions shown in Table 2.

In this section, we provide an overview of the Minerva LLMs: we describe their tokenizers, the design choices behind the model architecture, and how we trained the resulting LLMs. 3.1. Vocabulary and Tokenizers

Tokenizer Mistral-7B Gemma-7B Minerva-350M Minerva-1B Minerva-3B Minerva-7B |Vocab| the more recent model releases by Mistral, Minerva-7B does not use SWA. Instead, it implements full attention across its entire context length, which can extend up to 4096 tokens, i.e., double the number of tokens for the

SWA used in Minerva-350M/1B/3B. The parameters for

each model size are detailed in Table 3, for which we provide a more in-depth description in Appendix D.

Building Minerva on top of Mistral’s model architecture also brings other benefits, such as broad compatibility with the ecosystem of libraries, frameworks, and 3.3. Model Training We train all the Minerva LLMs using MosaicML’s LLM

Given the importance for our Minerva LLMs of hav- Foundry.2 The training process is conducted on the ing a low fertility on Italian text, we intentionally train Leonardo Supercomputer3 hosted and maintained by the Minerva tokenizer on a balanced mix of English and CINECA. Each node in Leonardo is equipped with 4 × Italian data (and code data for the 7B model). Our anal- custom NVIDIA A100 SXM4 with 64GB of VRAM. ysis shows that this strategy leads to a much improved All our models are trained using the AdamW optifertility on Italian data, while at the same time maintain- mizer [29] with 1 = 0.9, 2 = 0.95, = 10 −8 (with the ing similar fertility on English data. More specifically, only exception being Minerva-7B, which is trained usfor Minerva-350M/1B/3B, we opted for a vocabulary size ing = 10 −5) on a standard causal language modeling similar to that of Mistral-7B (around 32k tokens): in this training objective. To smooth the training process, we case, the fertility of the Minerva tokenizer is ~20% better follow standard practice in the literature and employ a than the Mistral tokenizer on the Italian Wikipedia and warmup-then-cooldown learning rate scheduling. More only ~1% worse on the English Wikipedia. Following specifically, we first increase the learning rate linearly recent trends in LLMs, for Minerva-7B, we increased the during the initial training phase (2% of the total numvocabulary size to around 50k tokens, which resulted in ber of training steps for Minerva-350M/1B/3B and 0̃.3% a further fertility improvement of ~6% and ~5% on the for Minerva-7B) until the peak learning rate is reached Italian and English Wikipedias, respectively, notwith- (2×10−4 for Minerva-350M/1B/3B, 3×10−4 for Minervastanding the addition of code data to the training data. 7B), and then decrease the learning rate with a cosine We provide more details on the tokenizer in Appendix C. scheduling until the end of the training process. The hyperparameters used for each model are shown in Table 7.

3.2. Model Architecture 4. Evaluation

While the field of LLMs is moving rapidly, one of the best models when our eforts started was Mistral. Therefore, We measure the 0-shot performance of our Minerva LLMs our Minerva LLMs are based on Mistral’s model archi- on ITA-Bench [30], a suite of benchmarks that have tecture. The Minerva LLMs are, therefore, a family of been created either by translating existing benchmarks decoder-only transformer models, with a few standout from other languages, or by adapting existing Italian features, such as grouped-query attention (GQA) [23], benchmarks so that they can be used for LLM evaluation. which boosts inference speed and reduces memory re- ITA-Bench includes a set of 10 benchmarks commonly quirements for increased throughput, and sliding win- used to evaluate LLMs, namely, ARC Challenge (ARCdow attention (SWA) [24, 25], which manages longer se- C), ARC Easy (ARC-E) [31], BoolQ [32], GSM8K [33], quences more eficiently at reduced computational costs. HellaSwag (HS) [34], MMLU [35], PIQA [36], SciQ [37], Specifically, the GQA is configured to share one key-value TruthfulQA [38], and Winogrande (WG) [39]. Overall, pair every four queries, while the SWA configuration han- these benchmarks ofer a comprehensive view of the capadles up to 2,048 tokens with a maximum context length of bilities of an LLM on a wide variety of aspects, including 16,384 tokens. We build four models with diferent sizes scientific knowledge, world knowledge (e.g., geography, by scaling the number of attention heads, hidden size, politics, economics), commonsense knowledge, physical intermediate size, and hidden layers, while maintaining a ratio of ~3.5 between the hidden size and intermediate 2https://github.com/mosaicml/llm-foundry size, as in the original Mistral model. However, following 3https://leonardo-supercomputer.cineca.eu/

SW Length

Ctx. Length interactions, coreference, and math reasoning, among on ITA-Bench every 10,000 training steps, the model is others. Employing automatically-translated benchmarks still slowly improving towards the end of the pretraining is far from ideal, but it allows us to better compare the phase, suggesting that a larger training corpus or multiscores obtained in Italian with those obtained in English, ple epochs may be beneficial in future developments. while awaiting as the Italian research community develops Italian-specific benchmarks [ 40].

As shown in Table 4, the average performance of the 5. Downstream tasks

Minerva models increases steadily with the model size.

For our 3B model, we also provide a comparison with In this section, we show the results of the Minerva modtwo models of the same size: XGLM [41], a multilingual els when adapted to two downstream applications. This LLM by META, and OpenELM [42], a very recent English- analysis is particularly relevant for Minerva-350M and only model developed by Apple. Our evaluation shows Minerva-1B, which can be utilized for specific tasks rather that Minerva-3B outperforms XGLM and OpenELM by a than as general-purpose models, ofering lower computasignificant margin, i.e., +4.4% and +3.7% on average. tional costs. The tasks in this analysis include: i) Italian

Finally, Minerva-7B achieves the highest performance Abstractive News Summarization, and ii) Machine Transamong the Minerva LLMs family, as expected. No- lation, in both directions (IT-EN and EN-IT). tably, Minerva-7B, achieves a higher average score than Llamantino-2. This is an interesting comparison be- News Summarization. Following Sarti and Nissim cause the pretraining data for Llama-2, i.e., the pretrained [43], we fine-tune Minerva models (up to 3B) on a conLLM used to build Llamantino-2, is not available and has catenation of two Italian news summarization datasets: never been disclosed, making the model open-weights Fanpage.it and Il Post newspapers [44]. A detailed but not entirely open-source.4 When compared to closed- overview of the hyperparameters used to train our modsourced LLMs such as Mistral-7B-v0.1 or Llama-3.1-8B, els is provided in Appendix E. We can find that MinervaMinerva still lags behind in some tasks, such as BoolQ 3B obtains the best results (0.30 vs 0.29 of the second best or GSM8K, which may require better reasoning capabil- in terms of Rouge-L); however, it is not as parameterities and/or more pretraining data. As we can observe eficient as IT5-Large, probably because encoder-decoder from Figure 1, which tracks the progress of Minerva-7B models are more suitable for fine-tuning than decoderonly models [45]. In Table 8, we report the full results of

Minerva fine-tuned on the aforementioned datasets and

compared to baselines in Sarti and Nissim [43], which

4We stress that, for Llamantino-2, only the data that has been used

for the language adaptation process is available, whereas the pretraining data is not.

Minerva-7B-base-v1.0 Progress over time: average accuracy on ITA-Bench include mBART, mT5, and IT5.

Machine Translation. We also evaluate our Minerva

LLMs in few-shot [46] machine translation on two benchmarks, FLORES [47] and OPUS-100 [48]. We explore how LLMs perform this task relying only on in-contextlearning few-shot examples, reporting our results with

5-shot prompting. We rely on the vLLM library [28] and change the default parameters with temperature=0 and max_tokens=512. We highlight that Minerva-3B reaches competitive re

sults in MT in both EN-IT (84.8 on Flores and 76.7 on

Opus in terms of COMET score) and IT-EN (85.7 and 78.0).

Compared with other models of similar size, Minerva-3B shows strong results when the target language is Italian (+1.7 and +2.7 compared to Gemma-2B and Qwen-1.5B on Opus). Minerva-7B further showcases this by achieving the highest performance among models tested when translating from English into Italian. The full results are reported in Table 5. minerva) showcase promising results on a variety of Italian benchmarks and downstream tasks, including news summarization and machine translation. Most importantly, we describe, for the first time, the process of cre6. Conclusion and Future Work ating an Italian pretraining corpus with more than 1T tokens, and we share findings and insights into the preIn this paper, we demonstrated the feasibility and bene- training process of Italian LLMs with the academic and ifts of pretraining Italian language models from scratch, industrial communities, paving the way for future rewhich not only improves the computational eficiency search in training non-English language models. We and performance of an LLM for a target language but re- hope that our contributions will represent a stepping duce linguistic biases inherited from English training cor- stone for future work on language-specific and multilinpora [49]. The Minerva models (https://nlp.uniroma1.it/ gual large-scale language modeling. Model Minerva-1B Minerva-3B Minerva-7B Gemma-2B Qwen-1.5B TinyLlama-1.1B-v1.1 LLaMa-2-7B Mistral-7B Qwen-7B

FLORES

OPUS EN-IT ↑

IT-EN ↑

EN-IT ↑ IT-EN ↑ 66.37 84.83 87.02 83.31 80.18 73.40 85.24 86.56 86.00 73.72 85.67 87.20 86.51 86.16 83.62 87.47 87.75 87.66 57.40 76.74 79.07 75.05 74.01 65.72 77.30 78.08 78.50 64.61 78.04 79.91 78.94 78.95 75.44 80.36 80.56 81.21

Edoardo Barba, Simone Conia and Pere-Lluís Huguet Cabot are fully funded by the PNRR MUR project PE0000013-FAIR. Roberto Navigli acknowledges the sup

port of the CREATIVE PRIN project. The authors acknowledge the CINECA award IsB28_medit under the

ISCRA initiative for the availability of high-performance

computing resources and support. //github.com/togethercomputer/RedPajama-Data. Smith, J. Dodge, What’s in my big data?, in: The [16] G. Penedo, Q. Malartic, D. Hesslow, R. Cojo- Twelfth International Conference on Learning Repcaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Al- resentations, 2024. URL: https://openreview.net/ mazrouei, J. Launay, The RefinedWeb dataset forum?id=RvfPnOkPV4. for Falcon LLM: outperforming curated corpora [22] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, with web data, and web data only, arXiv preprint A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, arXiv:2306.01116 (2023). URL: https://arxiv.org/abs/ J. Launay, The refinedweb dataset for falcon llm: 2306.01116. arXiv:2306.01116. Outperforming curated corpora with web data, and [17] M. Faysse, P. Fernandes, N. M. Guerreiro, A. Loison, web data only, 2023. URL: https://arxiv.org/abs/ D. M. Alves, C. Corro, N. Boizard, J. Alves, R. Rei, 2306.01116. arXiv:2306.01116.

P. H. Martins, A. B. Casademunt, F. Yvon, A. F. T. [23] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. ZemlyanMartins, G. Viaud, C. Hudelot, P. Colombo, Crois- skiy, F. Lebrón, S. Sanghai, Gqa: Training generalsantllm: A truly bilingual french-english language ized multi-query transformer models from multimodel, 2024. URL: https://arxiv.org/abs/2402.00786. head checkpoints, arXiv preprint arXiv:2305.13245 arXiv:2402.00786. (2023). [18] G. Penedo, H. Kydlíček, L. B. allal, A. Lozhkov, [24] R. Child, S. Gray, A. Radford, I. Sutskever, GeneratM. Mitchell, C. Rafel, L. V. Werra, T. Wolf, The ing long sequences with sparse transformers, arXiv ifneweb datasets: Decanting the web for the finest preprint arXiv:1904.10509 (2019). text data at scale, 2024. URL: https://arxiv.org/abs/ [25] I. Beltagy, M. E. Peters, A. Cohan, Longformer: 2406.17557. arXiv:2406.17557. The long-document transformer, arXiv preprint [19] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, arXiv:2004.05150 (2020).

M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku- [26] G. Gerganov, llama.cpp: Inference of meta’s llama mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. A. model (and others) in pure c/c++, ???? URL: https: Cosgrove, C. D. Manning, C. Re, D. Acosta-Navas, //github.com/ggerganov/llama.cpp. D. A. Hudson, E. Zelikman, E. Durmus, F. Lad- [27] T. Dao, FlashAttention-2: Faster attention with hak, F. Rong, H. Ren, H. Yao, J. WANG, K. San- better parallelism and work partitioning, in: Interthanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suz- national Conference on Learning Representations gun, N. Kim, N. Guha, N. S. Chatterji, O. Khat- (ICLR), 2024. tab, P. Henderson, Q. Huang, R. A. Chi, S. M. Xie, [28] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. S. Santurkar, S. Ganguli, T. Hashimoto, T. Icard, Yu, J. E. Gonzalez, H. Zhang, I. Stoica, Eficient memT. Zhang, V. Chaudhary, W. Wang, X. Li, Y. Mai, ory management for large language model serving Y. Zhang, Y. Koreeda, Holistic evaluation of lan- with pagedattention, in: Proceedings of the ACM guage models, Transactions on Machine Learn- SIGOPS 29th Symposium on Operating Systems ing Research (2023). URL: https://openre view.net/ Principles, 2023 . forum?id=iO4LZibEqW, featured Certification, Ex- [29] I. Loshchilov, F. Hutter, Decoupled weight decay pert Certification. regularization, arXiv preprint arXiv:1711.05101 [20] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy- (2017).

Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, [30] L. Moroni, S. Conia, F. Martelli, R. Navigli, ITAT. Liu, M. Tian, D. Kocetkov, A. Zucker, Y. Belkada, Bench: Towards a more comprehensive evaluation Z. Wang, Q. Liu, D. Abulkhanov, I. Paul, Z. Li, W.-D. for Italian LLMs, in: CLiC-it, 2024. Li, M. Risdal, J. Li, J. Zhu, T. Y. Zhuo, E. Zheltonozh- [31] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabskii, N. O. O. Dade, W. Yu, L. Krauß, N. Jain, Y. Su, harwal, C. Schoenick, O. Tafjord, Think you have X. He, M. Dey, E. Abati, Y. Chai, N. Muennighof, solved question answering? try arc, the ai2 reaX. Tang, M. Oblokulov, C. Akiki, M. Marone, C. Mou, soning challenge, arXiv preprint arXiv:1803.05457 M. Mishra, A. Gu, B. Hui, T. Dao, A. Zebaze, (2018).

O. Dehaene, N. Patry, C. Xu, J. McAuley, H. Hu, [32] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, T. Scholak, S. Paquet, J. Robinson, C. J. Anderson, M. Collins, K. Toutanova, Boolq: Exploring the N. Chapados, M. Patwary, N. Tajbakhsh, Y. Jer- surprising dificulty of natural yes/no questions, nite, C. M. Ferrandis, L. Zhang, S. Hughes, T. Wolf, arXiv preprint arXiv:1905.10044 (2019). A. Guha, L. von Werra, H. de Vries, Starcoder [33] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, 2 and the stack v2: The next generation, 2024. H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, arXiv:2402.19173. R. Nakano, et al., Training verifiers to solve math [21] Y. Elazar, A. Bhagia, I. H. Magnusson, A. Ravichan- word problems, arXiv preprint arXiv:2110.14168 der, D. Schwenk, A. Suhr, E. P. Walsh, D. Groen- (2021). eveld, L. Soldaini, S. Singh, H. Hajishirzi, N. A. [34] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, RedPajama-Data-V2 CulturaX Wikipedia Gutenberg Wikisource EurLex Gazzetta Uficiale FineWeb CulturaX Wikipedia ArXiv Gutenberg StackExchange The Stack V2

Tokens

Language https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2

https://huggingface.co/datasets/uonlp/CulturaX https://huggingface.co/datasets/wikimedia/wikipedia https://huggingface.co/datasets/manu/project_gutenberg https://huggingface.co/datasets/wikimedia/wikisource https://huggingface.co/datasets/joelito/eurlex_resources https://huggingface.co/datasets/mii-llm/gazzetta-ufficiale https://huggingface.co/datasets/HuggingFaceFW/fineweb

https://huggingface.co/datasets/uonlp/CulturaX https://huggingface.co/datasets/wikimedia/wikipedia https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

https://huggingface.co/datasets/manu/project_gutenberg https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids

D. Model B. Dataset Insights

to 7 billion parameters of the largest model, Minerva

7B. The Minerva family also includes Minerva-1B and

We leveraged the WIMBD5 library to compute word Minerva-3B, with 1 billion and 3 billion parameters, recounts per URL domain on CulturaX. We decided not spectively. More specifically, the Minerva-7B model is to do this for RedPajama v2 or FineWeb as their origi- based directly on the Mistral-7B architecture, with the nal data already provides token count and other insights sole modifications being the vocabulary size, which we into the dataset distribution. Figures 2 and 3 show the increase to 51,200 tokens, and the context length, which is aggregation of word counts per domain for Italian and set to 4,096 tokens without activating the sliding window English, respectively. attention feature. Hence, Minerva-7B is structured as a decoder-only transformer model, comprising 32 layers.

C. Tokenizer Each layer includes 32 attention heads, where each keyvalue pair is shared among four queries. Additionally, We trained two tokenizers for Minerva. The first one is the model features feed-forward layers with a hidden shared by the three smaller sizes, 350M, 1B and 3B. It is size of 4096 and an intermediate size of 14336, which is trained on a mix of 4GB of Italian text data and 4GB of 3.5 times the hidden size. Minerva-3B is a scaled down English text data, both from CulturaX. Our objective is version of Minerva-7B, and it shares similar features with to have a balanced vocabulary across the two languages, Mistral-7B, including a maximum context length of 16,384 mirroring the training data. We use the SentencePiece tokens, sliding window attention spanning 2,048 tokens, library6 to train a BPE tokenizer and we apply byte fall- and a vocabulary size of 32,768 tokens. To achieve apback. We set a vocabulary size of 32,768 as a multiple of proximately 3 billion parameters, we have reduced the 8, which is recommended by some GPU architectures. hidden size to 2560 and the intermediate size to 8960.

For the 7B tokenizer, we increase the vocabulary size Minerva-1B and Minerva-350M difer from their larger to account for the inclusion of code data, up to 51,200. counterpart in several key respects. Both models have We also train a BPE tokenizer7 with 4GB of English text, 16 attention heads, in contrast to the higher count in the 4GB of Italian and 1GB of code. The text data is sampled larger model. Additionally, the hidden and intermedifrom the training mix of datasets for the 7B, as reported ate sizes of the feed-forward layers is reduced further: in Table 1. Minerva-1B features a hidden size of 2048 and an intermediate size of 7168, while Minerva-350M has a hidden size of 1152 and an intermediate size of 4032. The complete list of parameters is reported in Table 3.

The Minerva LLM family consists of four models, each

sharing the same underlying architecture, i.e., that of

Mistral-7B. The models are diferentiated by their size,

ranging from 350 million parameters of Minerva-350M

5https://github.com/allenai/wimbd 6https://github.com/google/sentencepiece 7https://huggingface.co/docs/tokenizers/en/api/trainers E. News Summarization Additional results. Table 8 reports the full results of our evaluation on news summarization.

docplayer.it radioradicale.it

tripadvisor.it it.wikipedia.org it.blastingnews.com it.m.wikipedia.org ilsussidiario.net

treccani.it sentenze.laleggepertutti.it ilgiornale.it slideplayer.it it.scribd.com 247.libero.it ilfattoquotidiano.it

kijiji.it issuu.com repubblica.it medicitalia.it

corriere.it it.paperblog.com

lastampa.it blitzquotidiano.it huffingtonpost.it

tg24.sky.it L immobiliare.it RU it.jooble.org

it.topwar.ru laleggepertutti.it

ibs.it ilsole24ore.com slideshare.net it.knowledgr.com informazione.it tuttosu.virgilio.it spaziogames.it it.notizie.yahoo.com airbnb.it caasa.it blog.giallozafferano.it icase.it jetcost.it amazon.it affaritaliani.it emagister.it it.venere.com ilfoglio.it ansa.it sport.sky.it artribune.com documenti.camera.it 0.000 class.8 The hyperparameters we used are reported in that Minerva-350M and Minerva-1B were finetuned using

AdamW optimizer [29]. Minerva-3B was trained using

AdamW_Paged_32bit, a lighter version of AdamW, which in [43]. We also tried out diferent combinations, but we allows a larger batch size to be used during training. noticed that the best evaluation scores are given by the

8https://huggingface.co/docs/trl/en/sft_trainer

google.com issuu.com scribd.com tripadvisor.com patents.google.com

docplayer.net en.wikipedia.org tripadvisor.co.uk slideshare.net amazon.com nytimes.com dailymail.co.uk

scout.com stackoverflow.com

archive.org patentsencyclopedia.com journals.plos.org theguardian.com

frontiersin.org seekingalpha.com

law.justia.com ncbi.nlm.nih.gov washingtonpost.com slideplayer.com L link.springer.com RU nature.com fanfiction.net barnesandnoble.com reddit.com hindawi.com patents.justia.com amazon.co.uk medium.com finance.yahoo.com

hubpages.com s3-us-west-1.amazonaws.com shameface.com.wstub.archive.org casetext.com thefreelibrary.com bleacherreport.com rightmove.co.uk prweb.com airbnb.co.uk semanticscholar.org articles.chicagotribune.com

expedia.com studymode.com forums.macrumors.com

telegraph.co.uk publications.parliament.uk 0.000

F. Few-shot Machine Translation

Here, we provide more details on our experimental setup for the Machine Translation task. In our experiments, we test the capability of a base model (i.e., with no instruction fine-tuning or task-specific fine-tuning) to translate a sentence from English to Italian and vice versa. Previously, LLMs have been shown to perform well in machine translation and they now rival task-specific MT systems on a number of benchmarks [50] and tasks [51]. In our case, we prompt the language models by providing a set of 5 randomly sampled English-to-Italian translations (and vice-versa for the Italian-to-English translation). Finally, we measure the translation performance of the models using COMET, a learned metric to assess the quality between an automatic translation and a gold reference, as COMET has shown better correlation with human judgement than other metrics, such as BLEU. mBART Large mT5 Small mT5 Base IT5 Small IT5 EL32 IT5 Base IT5 Large Minerva-350M Minerva-1B Minerva-3B 0.34 0.33 0.35 0.34 0.25 0.38 0.16 0.16 0.17 0.16 0.10 0.19 0.26 0.26 0.28 0.26 0.20 0.29 Value

tence? , arXiv preprint arXiv: 1905 . 07830 ( 2019 ). decoder, arXiv preprint arXiv:2304.04052 ( 2023 ). [35]

Hendrycks ,

Burns ,

Basart ,

Zou , [46]

Garcia ,

Bansal ,

Cherry , G. Foster, M. Krikun,

Representations (ICLR ) ( 2021 ). PMLR, 2023 , pp. 10867 - 10878 . [36]

Bisk ,

Zellers ,

Gao ,

Choi , et al., Piqa : [47]

Goyal ,

Gao ,

Chaudhary ,

P.-J.

Chen , G. Wen-

language, in: Proceedings of the AAAI confer- A. Fan, The flores-101 evaluation benchmark for

ence on artificial intelligence , volume 34 , 2020 , pp. low -resource and multilingual machine translation,

7432- 7439 . Transactions of the Association for Computational [37]

Welbl ,

N. F.

Liu ,

Gardner , Crowdsourcing Linguistics 10 ( 2022 ) 522 - 538 .

multiple choice science questions , arXiv preprint [48]

Zhang ,

Williams , I. Titov ,

Sennrich , Improv-

arXiv:1707.06209 ( 2017 ). ing massively multilingual neural machine trans [38]

Lin ,

Hilton ,

Evans , Truthfulqa: Measur- lation and zero-shot translation , arXiv preprint

ing how models mimic human falsehoods , arXiv arXiv: 2004 . 11867 ( 2020 ).

preprint arXiv:2109.07958 ( 2021 ). [49]

Navigli ,

Conia ,

Ross , Biases in large lan[39]

Sakaguchi ,

R. L.

Bras ,

Bhagavatula , Y. Choi, guage models: Origins, inventory, and discussion,

Winogrande : An adversarial winograd schema chal- J. Data and Information Quality 15 ( 2023 ) 1 - 21 .

lenge at scale, Communications of the ACM 64 URL: https://doi.org/10.1145/3597307. doi: 10 .1145/

( 2021 ) 99 - 106 . 3597307 . [40]

Mercorio ,

Mezzanzanica ,

Potertì ,

Serino , [50]

Conia ,

Li ,

Lee ,

Minhas , I. Ilyas,

2024. URL: https://arxiv.org/abs/2406.17535. in: H. Bouamor , J. Pino , K. Bali (Eds.), Pro-

arXiv:2406.17535. ceedings of the 2023 Conference on Empirical [41]

X. V.

Lin ,

Mihaylov ,

Artetxe ,

Wang , Methods in Natural Language Processing , Asso-

ale , J. Du, R.

Pasunuru , S.

Shleifer , P. S.

Koura , 2023 , pp. 1612 - 1634 . URL: https://aclanthology.

Chaudhary ,

B. O

'Horo ,

Wang , L. Zettlemoyer, org/ 2023 .emnlp-main. 100 . doi: 10 .18653/v1/ 2023 .

Kozareva ,

Diab ,

Stoyanov ,

Li , Few- emnlp-main. 100 .

shot learning with multilingual generative lan- [51]

Conia ,

Lee ,

Li ,

U. F.

Minhas , S. Potdar,

Zhang (Eds.), Proceedings of the 2022 Con - with retrieval-augmented generation from multi-

guage Processing , Association for Computational 2024 Conference on Empirical Methods in Natu-

2022 , pp. 9019 - 9052 . URL: https://aclanthology. tional Linguistics, Miami, Florida, USA, 2024 . URL:

org/ 2022 .emnlp-main. 616 . doi: 10 .18653/v1/ 2022 . https://arxiv.org/abs/2410.14057.

emnlp-main. 616 . [42]

Mehta ,

M. H.

Sekhavat ,

Cao ,

Horton ,

Jin ,

guage Model Family with Open Training and In- Table 6 shows the source of each dataset used to train

ference Framework , arXiv.org ( 2024 ). URL: https: Minerva in its diferent sizes . The Tokens column shows

//arxiv.org/abs/2404.14619v1. the total number of tokens we used from each dataset . [43]

Sarti , M.

Nissim, It5: Large-scale text-to-text Where Table 1 shows more tokens used for training, it

generation , arXiv preprint arXiv:2203.03759 ( 2022 ). reach that number . All these datasets are openly licensed . [44]

Landro , I. Gallo ,

R. La

Grassa , E. Federici, Two

summarization, Information 13 ( 2022 ) 228 . [45]

Fu ,

Lam ,

Yu , A. M.-C. So , S. Hu , Z . Liu,