Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data Riccardo Orlando1,† , Luca Moroni1,† , Pere-Lluís Huguet Cabot1,† , Edoardo Barba1 , Simone Conia1 , Sergio Orlandini2 , Giuseppe Fiameni3 and Roberto Navigli1,∗ 1 Sapienza NLP Group, Dipartimento di Ingegneria Informatica, Automatica e Gestionale, Sapienza University of Rome, Italy 2 CINECA, Bologna, Italy 3 NVIDIA, Santa Clara, California, USA Abstract The growing interest in Large Language Models (LLMs) has accelerated research efforts to adapt these models for various languages. Despite this, pretraining LLMs from scratch for non-English languages remains underexplored. This is the case for Italian, where no truly open-source research has investigated the pretraining process. To address this gap, we introduce Minerva (https://nlp.uniroma1.it/minerva), the first family of LLMs trained entirely from scratch on native Italian texts. Our work is the first investigation into the challenges and opportunities of pretraining LLMs specifically for the Italian language, offering insights into vocabulary design, data composition, and model development. With Minerva, we demonstrate that building an LLM tailored to a specific language yields numerous practical benefits over adapting existing multilingual models, including greater control over the model’s vocabulary and the composition of its training data. We provide an overview of the design choices, pretraining methods, and evaluation metrics used to develop Minerva, which shows promising performance on Italian benchmarks and downstream tasks. Moreover, we share the lessons learned throughout Minerva’s development to support the academic and industrial communities in advancing non-English LLM research. We believe that Minerva serves as an important step towards closing the gap in high-quality, open-source LLMs for non-English languages. Keywords Large Language Models, Language Modeling, Italian Language, LLM Pretraining 1. Introduction units, or tokens, that the model can use to compose text – when it is not optimized for the Italian language, result- Large Language Models (LLMs) have revolutionized the ing in Italian words being split into an excessive number way Natural Language Processing (NLP) tasks are ap- of tokens. Consequently, this creates longer sequences proached, achieving remarkable results in existing areas of tokens, slower generation times, and higher compu- and opening the door to entirely new research directions tational costs, especially since many popular attention and applications. As a result, the energy and resources mechanisms have a quadratic complexity with respect to dedicated to the study and creation of LLMs are grow- sequence length. ing exponentially. However, most LLMs – both closed- Efforts to create language-specific LLMs are increasing, and open-source – are predominantly designed for En- and fall primarily into two main categories: i) adapting glish, posing significant challenges and limitations for existing English-centric LLMs to other languages, and their use in non-English settings. In practice, generat- ii) training LLMs from scratch. The advantages of adapt- ing Italian text using multilingual or language-adapted ing existing English-centric LLMs to other languages English models, e.g., from Mistral [1] or Llama [2, 3], is are enticing: starting with a proven model can reduce computationally more expensive and often less effective the computational requirements, and adaptation can be compared to using a model specifically designed for the achieved with relatively modest amounts of data. There Italian language. This inefficiency stems from the vocab- are several language adaptation techniques, which range ulary of an English or multilingual LLM – i.e., the lexical from fine-tuning the model on data for the target lan- CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, guage [4, 5] to modifying the model’s architecture [6, 7, 8], Dec 04 — 06, 2024, Pisa, Italy making these techniques flexible for different budgets ∗ † Corresponding author. and objectives. However, these techniques may not fully These authors contributed equally. capture language-specific nuances and can degrade the Envelope-Open orlando@diag.uniroma1.it (R. Orlando); performance in the original language, indeed an unde- moroni@diag.uniroma1.it (L. Moroni); huguetcabot@diag.uniroma1.it (P. H. Cabot); sirable effect. Alternatively, training LLMs from scratch barba@diag.uniroma1.it (E. Barba); conia@diag.uniroma1.it provides the freedom to make design choices tailored (S. Conia); s.orlandini@cineca.it (S. Orlandini); to the linguistic features of the target language—includ- gfiameni@nvidia.com (G. Fiameni); navigli@diag.uniroma1.it ing morphology, lexicon, syntax, and semantics—which (R. Navigli) are often overlooked in English-centric models [9]. It © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings also allows for incorporating culturally relevant con- Dataset Minerva – Model Size tent, reducing biases that might be present in models Name Lang. 350M 1B 3B 7B primarily trained on English data, thus leading to more RedPajama-V2 Italian – – – 894B inclusive and accurate representations of language use. CulturaX Italian 35B 100B 330B 237B Wikipedia Italian – – – 1.3B Unfortunately, while there are several efforts on adapt- Gutenberg Italian – – – 0.15B ing English-centric LLMs to the Italian language, e.g., Wikisource Italian – – – 0.12B Llamantino-2 [4], Llamantino-3 [5], DanteLLM [10], and EurLex Italian – – – 1.6B Gazzetta Ufficiale Italian – – – 1.7B Camoscio [11], inter alia, there is no truly open-source FineWeb English – – – 1,076B endeavor exploring what can be achieved by training an CulturaX English 35B 100B 330B – LLM from scratch on Italian data. Wikipedia English – – – 5.3B With this work, we follow the latter path and introduce ArXiv English – – – 33B Gutenberg English – – – 7B Minerva, the first family of LLMs designed specifically StackExchange English – – – 22B for the Italian language and pretrained on Italian text.1 The Stack V2 Code – – – 201B We present the design choices for our models, our data Total # of tokens 70B 200B 660B 2.48T processing, and the evaluation results regarding our Min- erva LLMs, showing that our models – with 350M, 1B, Table 1 3B, and 7B parameters – outperform comparable multi- Datasets used to train Minerva with their languages (second column) and number of tokens (third to sixth columns). lingual models and even rival larger models adapted for Italian. We conclude with a discussion on the benefits and challenges of pretraining LLMs from scratch for the Italian language, sharing our experience and findings to We only use the code data to train our largest model, i.e., provide valuable insights for the academic and industrial Minerva-7B. communities interested in training non-English LLMs from scratch. Lastly, we describe the technical details of 2.1.1. Italian Data Minerva-7B, our latest model with 7.4 billion parameters, Web data. The majority of the text used to train LLMs for which we share our initial results. is sourced from Web-scraped data, typically from Com- monCrawl (CC). Therefore, a significant portion of Italian 2. Building a Pretraining Dataset text included in our training datasets is also of this nature, inherently exposing our models to potential biases and for Italian LLMs toxic content commonly found on the Web. Because pre- processing techniques, such as language identification, The field of LLMs is growing at an astonishing pace, with perplexity filtering, deduplication, and content classifi- new models, datasets, benchmarks, and techniques pre- cation are computationally expensive, the most sensible sented every week. However, over the past few months, choice is thus to rely on preprocessed collections, such academic and industrial researchers have increasingly as CulturaX [14] and RedPajama v2 [15]. These collec- recognized the fundamental role of the data used to pre- tions already include Italian data, and have undergone train LLMs. Unsurprisingly, the majority of the leading various levels of filtering and deduplication, as discussed companies are not releasing their training data as they in Section 2.2. seek to maintain an advantage over the competition, with very few exceptions (e.g. OLMo by AllenAI [12] and Curated data. While Penedo et al. [16] suggest that OpenELM by Apple [13]). In this section, we describe the high-quality Web data is sufficient on its own to train different sources of data used in the training of the Min- LLMs, curated data sources are often used to further im- erva models, and Table 1 provides an overview of these prove the model performance and introduce a broader (cf. Appendix A for more details). Most importantly, the diversity of data types, such as encyclopedic and aca- training datasets we used are entirely available online, demic text [17], as well as scientific and math-related making our process transparent and allowing researchers text. Therefore, we include curated texts from several to better study the connection between pretraining data sources, including Wikipedia (encyclopedic/world knowl- and model behavior. edge data), EurLex and Gazzetta Ufficiale (law, economics, and politics), and the Gutenberg Project (novels, poetry, 2.1. Data Sources etc.). The training data for our Minerva models consists of three main categories: Italian, English, and code data. 2.1.2. English Data Web data. Mirroring our approach with the Italian 1 https://nlp.uniroma1.it/minerva data, we use preprocessed collections of English data from the Web. Given that English is the most popular of text that appear too often so as to minimize memoriza- language on the Internet and has been the primary focus tion. of LLM research, there are numerous options that already As mentioned above, for the corpus used to train the provide a large amount of tokens from filtered, dedupli- Minerva models, we rely mainly on collections of data cated, and cleaned sources. For our Minerva-350M, 1B, that has already been filtered and deduplicated. However, and 3B models, we collect data from the English partition there are some minor considerations that depend on each of CulturaX, capping the number of tokens to the same collection of data. More specifically, we use CulturaX amount as the Italian ones, as shown in Table 1. Instead, as-is, relying on their filtering and deduplication pipeline. to train Minerva-7B, we use a portion of FineWeb [18], Unfortunately, RedPajama v2 is not filtered and dedupli- which includes filtered and deduplicated CC dumps with cated; however, its data is tagged with meta-information various timestamps. Specifically, we use the CC dumps that can be used to apply filtering and deduplication. from 2023-14 to 2024-18 to match the total number of Such metadata includes, for example, the perplexity score tokens in the Italian Web partition of our training data. of each text computed via a language model trained on Wikipedia, which is used to partition RedPajama v2 into Curated sources. We include the 5.3B tokens from three partitions: head, middle, tail. For our training cor- the English Wikipedia and 7B tokens from the copyright- pus, we only include a document if it is classified as free books in Project Gutenberg. Additionally, we include head or middle according to its perplexity score. More- data from arXiv and StackExchange, which are included over, we use the precomputed metadata to remove exact in the RedPajama dataset. duplicates and apply fuzzy deduplication. The latter is performed by using the hash provided for each document 2.1.3. Code Data with Locality Sensitive Hashing and Jaccard similarity 0.7 to decide whether two documents are fuzzy duplicates. Previous work has highlighted the importance of includ- Note that we only apply fuzzy deduplication within each ing source code in the pretraining corpus of an LLM, CC dump, rather than across all the dumps. This decision in order to improve not only its code understanding is motivated by two observations: first, applying fuzzy and generation, but also its general reasoning capabil- deduplication across all CC dumps is computationally ities [19] even for tasks that do not directly involve or expensive; second, previous work [18] has shown that require programming. Therefore, for our largest model per-CC deduplication is not only sufficient, but is also – Minerva-7B – we also include a portion of code data. beneficial, when training English LLMs. More specifically, we extract 200B tokens from The Stack V2 [20], selecting the data from their deduplicated parti- tion, which includes 17 of the most popular programming 3. Minerva LLMs languages on GitHub. In this section, we provide an overview of the Minerva LLMs: we describe their tokenizers, the design choices 2.2. Data Preprocessing behind the model architecture, and how we trained the As mentioned above, our preprocessing effort remains resulting LLMs. minimal, as we rely on the preprocessing pipelines used in CulturaX, RedPajama, and FineWeb. To evaluate the 3.1. Vocabulary and Tokenizers content and quality of our training data, we employ the methodology described in Elazar et al. [21] to analyze the The vocabulary of an LLM is mainly impacted by its size, URL domain distribution within the Italian partition of i.e., the number of tokens in the vocabulary itself, and CulturaX and RedPajama, as these partitions had never how the tokenizer is trained, i.e., which tokens make up been utilized in training an LLM prior to Minerva. We the vocabulary. These two factors impact the fertility provide an overview of our analysis together with a few of the resulting tokenizer, which measures the average insights in Appendix B. number of tokens (subwords) into which a word is split. Tokenizers with lower fertility are preferable, as the input and output sequences they produce are shorter, result- 2.3. Data Filtering and Deduplication ing in an efficiency gain, especially as most attention Previous work on English-centric LLMs [22] has already mechanisms are quadratic with respect to the sequence emphasized the importance of training LLMs on “clean” length. Unsurprisingly, the vocabulary allocation of an data. Two of the most important parts of data cleaning English-centric LLM minimizes the fertility of English are filtering, i.e., removing content that does not satisfy a text, and results in high fertility values for Italian text, as set of criteria, and deduplication, i.e., removing portions shown in Table 2. Fertility (↓ – lower is better ) the more recent model releases by Mistral, Minerva-7B CulturaX Wikipedia does not use SWA. Instead, it implements full attention Tokenizer |Vocab| Ita Eng Ita Eng across its entire context length, which can extend up to 4096 tokens, i.e., double the number of tokens for the Mistral-7B 32,000 1.87 1.32 2.05 1.57 Gemma-7B 256,000 1.42 1.18 1.56 1.34 SWA used in Minerva-350M/1B/3B. The parameters for Minerva-350M 32,768 1.39 1.32 1.66 1.59 each model size are detailed in Table 3, for which we Minerva-1B 32,768 1.39 1.32 1.66 1.59 provide a more in-depth description in Appendix D. Minerva-3B 32,768 1.39 1.32 1.66 1.59 Building Minerva on top of Mistral’s model architec- Minerva-7B 51,200 1.32 1.26 1.56 1.51 ture also brings other benefits, such as broad compati- Table 2 bility with the ecosystem of libraries, frameworks, and Fertility rates (lower is better) for Minerva tokenizers com- tools that has emerged over recent months, including pared to other LLMs. The fertility rates are computed on llama.cpp [26], FlashAttention [27], and vLLM [28]. a randomly sampled collection of texts from CulturaX and Wikipedia in both Italian (Ita) and English (Eng). 3.3. Model Training We train all the Minerva LLMs using MosaicML’s LLM Given the importance for our Minerva LLMs of hav- Foundry.2 The training process is conducted on the ing a low fertility on Italian text, we intentionally train Leonardo Supercomputer3 hosted and maintained by the Minerva tokenizer on a balanced mix of English and CINECA. Each node in Leonardo is equipped with 4 × Italian data (and code data for the 7B model). Our anal- custom NVIDIA A100 SXM4 with 64GB of VRAM. ysis shows that this strategy leads to a much improved All our models are trained using the AdamW opti- fertility on Italian data, while at the same time maintain- mizer [29] with 𝛽1 = 0.9, 𝛽2 = 0.95, 𝑒𝑝𝑠 = 10−8 (with the ing similar fertility on English data. More specifically, only exception being Minerva-7B, which is trained us- for Minerva-350M/1B/3B, we opted for a vocabulary size ing 𝑒𝑝𝑠 = 10−5 ) on a standard causal language modeling similar to that of Mistral-7B (around 32k tokens): in this training objective. To smooth the training process, we case, the fertility of the Minerva tokenizer is ~20% better follow standard practice in the literature and employ a than the Mistral tokenizer on the Italian Wikipedia and warmup-then-cooldown learning rate scheduling. More only ~1% worse on the English Wikipedia. Following specifically, we first increase the learning rate linearly recent trends in LLMs, for Minerva-7B, we increased the during the initial training phase (2% of the total num- vocabulary size to around 50k tokens, which resulted in ber of training steps for Minerva-350M/1B/3B and 0̃.3% a further fertility improvement of ~6% and ~5% on the for Minerva-7B) until the peak learning rate is reached Italian and English Wikipedias, respectively, notwith- (2×10−4 for Minerva-350M/1B/3B, 3×10−4 for Minerva- standing the addition of code data to the training data. 7B), and then decrease the learning rate with a cosine We provide more details on the tokenizer in Appendix C. scheduling until the end of the training process. The hy- perparameters used for each model are shown in Table 7. 3.2. Model Architecture While the field of LLMs is moving rapidly, one of the best 4. Evaluation models when our efforts started was Mistral. Therefore, We measure the 0-shot performance of our Minerva LLMs our Minerva LLMs are based on Mistral’s model archi- on ITA-Bench [30], a suite of benchmarks that have tecture. The Minerva LLMs are, therefore, a family of been created either by translating existing benchmarks decoder-only transformer models, with a few standout from other languages, or by adapting existing Italian features, such as grouped-query attention (GQA) [23], benchmarks so that they can be used for LLM evaluation. which boosts inference speed and reduces memory re- ITA-Bench includes a set of 10 benchmarks commonly quirements for increased throughput, and sliding win- used to evaluate LLMs, namely, ARC Challenge (ARC- dow attention (SWA) [24, 25], which manages longer se- C), ARC Easy (ARC-E) [31], BoolQ [32], GSM8K [33], quences more efficiently at reduced computational costs. HellaSwag (HS) [34], MMLU [35], PIQA [36], SciQ [37], Specifically, the GQA is configured to share one key-value TruthfulQA [38], and Winogrande (WG) [39]. Overall, pair every four queries, while the SWA configuration han- these benchmarks offer a comprehensive view of the capa- dles up to 2,048 tokens with a maximum context length of bilities of an LLM on a wide variety of aspects, including 16,384 tokens. We build four models with different sizes scientific knowledge, world knowledge (e.g., geography, by scaling the number of attention heads, hidden size, politics, economics), commonsense knowledge, physical intermediate size, and hidden layers, while maintaining a ratio of ~3.5 between the hidden size and intermediate 2 https://github.com/mosaicml/llm-foundry size, as in the original Mistral model. However, following 3 https://leonardo-supercomputer.cineca.eu/ Model Params Layers Hidden Size Inter. Size Att. Heads KV Heads SW Length Ctx. Length Minerva-350M 352M 16 1152 4032 16 4 2048 16,384 Minerva-1B 1.01B 16 2048 7168 16 4 2048 16,384 Minerva-3B 2.89B 32 2560 8960 32 8 2048 16,384 Minerva-7B 7.40B 32 4096 14336 32 8 None 4,096 Table 3 Overview of the main hyperparameters for our Minerva models. We include the number of parameters (approximately, 350M, 1B, 3B, and 7B) and the corresponding number of layers, hidden size, intermediate size, attention heads, key-value heads, sliding window length, and maximum context length. Size Name ARC-C ARC-E BoolQ GSM8K HS MMLU PIQA SciQ TQA WG AVG 0.4B Minerva-350M-base-v1.0 24.6 36.4 60.7 48.2 32.6 25.7 59.5 63.7 46.5 58.4 45.6 1B Minerva-1B-base-v1.0 26.6 42.2 57.1 49.7 39.6 27.0 62.9 73.5 44.6 60.0 48.3 3B OpenELM-3B 27.0 37.9 60.9 49.7 40.7 28.3 56.7 81.8 47.3 58.4 48.9 3B XGLM-2.9B 27.5 41.4 59.1 65.7 44.5 27.4 59.9 77.8 43.1 60.2 50.6 3B Minerva-3B-base-v1.0 31.4 49.1 62.1 55.8 52.9 29.2 66.9 79.9 41.4 62.2 53.1 7B OLMo-7B-0724-hf 30.7 44.0 72.9 52.5 47.9 30.9 58.7 85.1 44.6 61.2 52.8 7B LLaMAntino-2-7b 33.7 50.8 70.9 52.2 54.9 33.8 64.4 86.1 44.3 64.1 55.5 7B Minerva-7B-base-v1.0 42.0 68.8 79.5 50.0 62.6 36.2 69.8 87.7 38.5 65.0 60.0 7B Mistral-7B-v0.1 42.8 61.3 78.2 56.1 60.4 38.0 65.5 90.8 43.5 68.8 60.5 8B Llama-3.1-8B 44.0 61.1 78.0 57.8 62.9 38.7 67.7 90.3 43.0 69.2 61.3 Table 4 Zero-shot evaluation results of the Minerva models on a set of standard benchmarks translated from English to Italian. interactions, coreference, and math reasoning, among on ITA-Bench every 10,000 training steps, the model is others. Employing automatically-translated benchmarks still slowly improving towards the end of the pretraining is far from ideal, but it allows us to better compare the phase, suggesting that a larger training corpus or multi- scores obtained in Italian with those obtained in English, ple epochs may be beneficial in future developments. while awaiting as the Italian research community devel- ops Italian-specific benchmarks [40]. As shown in Table 4, the average performance of the 5. Downstream tasks Minerva models increases steadily with the model size. In this section, we show the results of the Minerva mod- For our 3B model, we also provide a comparison with els when adapted to two downstream applications. This two models of the same size: XGLM [41], a multilingual analysis is particularly relevant for Minerva-350M and LLM by META, and OpenELM [42], a very recent English- Minerva-1B, which can be utilized for specific tasks rather only model developed by Apple. Our evaluation shows than as general-purpose models, offering lower computa- that Minerva-3B outperforms XGLM and OpenELM by a tional costs. The tasks in this analysis include: i) Italian significant margin, i.e., +4.4% and +3.7% on average. Abstractive News Summarization, and ii) Machine Trans- Finally, Minerva-7B achieves the highest performance lation, in both directions (IT-EN and EN-IT). among the Minerva LLMs family, as expected. No- tably, Minerva-7B, achieves a higher average score than Llamantino-2. This is an interesting comparison be- News Summarization. Following Sarti and Nissim cause the pretraining data for Llama-2, i.e., the pretrained [43], we fine-tune Minerva models (up to 3B) on a con- LLM used to build Llamantino-2, is not available and has catenation of two Italian news summarization datasets: never been disclosed, making the model open-weights Fanpage.it and Il Post newspapers [44]. A detailed but not entirely open-source.4 When compared to closed- overview of the hyperparameters used to train our mod- sourced LLMs such as Mistral-7B-v0.1 or Llama-3.1-8B, els is provided in Appendix E. We can find that Minerva- Minerva still lags behind in some tasks, such as BoolQ 3B obtains the best results (0.30 vs 0.29 of the second best or GSM8K, which may require better reasoning capabil- in terms of Rouge-L); however, it is not as parameter- ities and/or more pretraining data. As we can observe efficient as IT5-Large, probably because encoder-decoder from Figure 1, which tracks the progress of Minerva-7B models are more suitable for fine-tuning than decoder- only models [45]. In Table 8, we report the full results of 4 We stress that, for Llamantino-2, only the data that has been used Minerva fine-tuned on the aforementioned datasets and for the language adaptation process is available, whereas the pre- compared to baselines in Sarti and Nissim [43], which training data is not. Minerva-7B-base-v1.0 Progress over time: average accuracy on ITA-Bench Figure 1: Tracking the progress of Minerva-7B during its pretraining process. Here, we report the average accuracy on ITA-Bench every 10,000 steps, i.e., every 40B tokens approximately. include mBART, mT5, and IT5. FLORES OPUS Model EN-IT ↑ IT-EN ↑ EN-IT ↑ IT-EN ↑ Machine Translation. We also evaluate our Minerva Minerva-1B 66.37 73.72 57.40 64.61 Minerva-3B 84.83 85.67 76.74 78.04 LLMs in few-shot [46] machine translation on two bench- Minerva-7B 87.02 87.20 79.07 79.91 marks, FLORES [47] and OPUS-100 [48]. We explore Gemma-2B 83.31 86.51 75.05 78.94 how LLMs perform this task relying only on in-context- Qwen-1.5B 80.18 86.16 74.01 78.95 learning few-shot examples, reporting our results with TinyLlama-1.1B-v1.1 73.40 83.62 65.72 75.44 LLaMa-2-7B 85.24 87.47 77.30 80.36 5-shot prompting. We rely on the vLLM library [28] and Mistral-7B 86.56 87.75 78.08 80.56 change the default parameters with temperature=0 and Qwen-7B 86.00 87.66 78.50 81.21 max_tokens=512. We highlight that Minerva-3B reaches competitive re- Table 5 COMET scores measure the translation capabilities of our sults in MT in both EN-IT (84.8 on Flores and 76.7 on Minerva models and other LLMs on the FLORES and OPUS Opus in terms of COMET score) and IT-EN (85.7 and 78.0). datasets. This evaluation is conducted in a 5-shot setting, Compared with other models of similar size, Minerva-3B where each model receives five random translation examples shows strong results when the target language is Italian from the development set before the test instance. (+1.7 and +2.7 compared to Gemma-2B and Qwen-1.5B on Opus). Minerva-7B further showcases this by achiev- ing the highest performance among models tested when minerva) showcase promising results on a variety of Ital- translating from English into Italian. The full results are ian benchmarks and downstream tasks, including news reported in Table 5. summarization and machine translation. Most impor- tantly, we describe, for the first time, the process of cre- 6. Conclusion and Future Work ating an Italian pretraining corpus with more than 1T tokens, and we share findings and insights into the pre- In this paper, we demonstrated the feasibility and bene- training process of Italian LLMs with the academic and fits of pretraining Italian language models from scratch, industrial communities, paving the way for future re- which not only improves the computational efficiency search in training non-English language models. We and performance of an LLM for a target language but re- hope that our contributions will represent a stepping duce linguistic biases inherited from English training cor- stone for future work on language-specific and multilin- pora [49]. The Minerva models (https://nlp.uniroma1.it/ gual large-scale language modeling. Acknowledgments [7] K. Dobler, G. de Melo, FOCUS: Effective embed- ding initialization for monolingual specialization Edoardo Barba, Simone Conia and Pere-Lluís Huguet of multilingual models, in: H. Bouamor, J. Pino, Cabot are fully funded by the PNRR MUR project K. Bali (Eds.), Proceedings of the 2023 Conference PE0000013-FAIR. Roberto Navigli acknowledges the sup- on Empirical Methods in Natural Language Pro- port of the CREATIVE PRIN project. The authors ac- cessing, Association for Computational Linguis- knowledge the CINECA award IsB28_medit under the tics, Singapore, 2023, pp. 13440–13454. URL: https: ISCRA initiative for the availability of high-performance //aclanthology.org/2023.emnlp-main.829. doi:10. computing resources and support. 18653/v1/2023.emnlp- main.829 . [8] Z. Csaki, B. Li, J. Li, Q. Xu, P. Pawakapan, L. Zhang, Y. Du, H. Zhao, C. Hu, U. Thakker, Sambalingo: References Teaching large language models new languages, [1] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam- arXiv preprint arXiv:2404.05829 (2024). ford, D. S. Chaplot, D. de las Casas, F. Bressand, [9] M. Faysse, P. Fernandes, N. Guerreiro, A. Loison, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.- D. Alves, C. Corro, N. Boizard, J. Alves, R. Rei, A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, P. Martins, et al., Croissantllm: A truly bilingual T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https: french-english language model, arXiv preprint //arxiv.org/abs/2310.06825. arXiv:2310.06825 . arXiv:2402.00786 (2024). [2] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. [10] A. Bacciu, C. Campagnano, G. Trappolini, F. Sil- Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- vestri, DanteLLM: Let’s push Italian LLM research bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, forward!, in: N. Calzolari, M.-Y. Kan, V. Hoste, G. Lample, Llama: Open and efficient foundation A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of language models, 2023. URL: https://arxiv.org/abs/ the 2024 Joint International Conference on Com- 2302.13971. arXiv:2302.13971 . putational Linguistics, Language Resources and [3] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- Evaluation (LREC-COLING 2024), ELRA and ICCL, hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- Torino, Italia, 2024, pp. 4343–4355. URL: https: gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, //aclanthology.org/2024.lrec-main.388. M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, [11] A. Santilli, E. Rodolà, Camoscio: an italian W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, instruction-tuned llama, 2023. arXiv:2307.16456 . A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kar- [12] D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, das, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko- R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Mag- renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, nusson, Y. Wang, S. Arora, D. Atkinson, R. Au- D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, thur, K. R. Chandu, A. Cohan, J. Dumas, Y. Elazar, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen- Y. Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, stein, R. Rungta, K. Saladi, A. Schelten, R. Silva, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, V. Pyatkin, A. Ravichander, D. Schwenk, S. Shah, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, W. Smith, E. Strubell, N. Subramani, M. Wortsman, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, P. Dasigi, N. Lambert, K. Richardson, L. Zettlemoyer, A. Rodriguez, R. Stojnic, S. Edunov, T. Scialom, J. Dodge, K. Lo, L. Soldaini, N. A. Smith, H. Ha- Llama 2: Open foundation and fine-tuned chat jishirzi, Olmo: Accelerating the science of language models, 2023. URL: https://arxiv.org/abs/2307.09288. models, 2024. URL: https://arxiv.org/abs/2402.00838. arXiv:2307.09288 . arXiv:2402.00838 . [4] P. Basile, E. Musacchio, M. Polignano, L. Siciliani, [13] S. Mehta, M. H. Sekhavat, Q. Cao, M. Horton, Y. Jin, G. Fiameni, G. Semeraro, Llamantino: Llama 2 C. Sun, I. Mirzadeh, M. Najibi, D. Belenko, P. Zat- models for effective text generation in italian lan- loukal, M. Rastegari, Openelm: An efficient lan- guage, 2023. URL: https://arxiv.org/abs/2312.09993. guage model family with open training and infer- arXiv:2312.09993 . ence framework, 2024. URL: https://arxiv.org/abs/ [5] M. Polignano, P. Basile, G. Semeraro, Advanced 2404.14619. arXiv:2404.14619 . natural-based interaction for the italian language: [14] T. Nguyen, C. V. Nguyen, V. D. Lai, H. Man, N. T. Llamantino-3-anita, 2024. arXiv:2405.07101 . Ngo, F. Dernoncourt, R. A. Rossi, T. H. Nguyen, [6] M. Ostendorff, G. Rehm, Efficient language model Culturax: A cleaned, enormous, and multilingual training through cross-lingual and progressive dataset for large language models in 167 languages, transfer learning, arXiv preprint arXiv:2301.09626 2023. arXiv:2309.09400 . (2023). [15] T. Computer, Redpajama: an open dataset for training large language models, 2023. URL: https: //github.com/togethercomputer/RedPajama-Data. Smith, J. Dodge, What’s in my big data?, in: The [16] G. Penedo, Q. Malartic, D. Hesslow, R. Cojo- Twelfth International Conference on Learning Rep- caru, A. Cappelli, H. Alobeidli, B. Pannier, E. Al- resentations, 2024. URL: https://openreview.net/ mazrouei, J. Launay, The RefinedWeb dataset forum?id=RvfPnOkPV4. for Falcon LLM: outperforming curated corpora [22] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, with web data, and web data only, arXiv preprint A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, arXiv:2306.01116 (2023). URL: https://arxiv.org/abs/ J. Launay, The refinedweb dataset for falcon llm: 2306.01116. arXiv:2306.01116 . Outperforming curated corpora with web data, and [17] M. Faysse, P. Fernandes, N. M. Guerreiro, A. Loison, web data only, 2023. URL: https://arxiv.org/abs/ D. M. Alves, C. Corro, N. Boizard, J. Alves, R. Rei, 2306.01116. arXiv:2306.01116 . P. H. Martins, A. B. Casademunt, F. Yvon, A. F. T. [23] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyan- Martins, G. Viaud, C. Hudelot, P. Colombo, Crois- skiy, F. Lebrón, S. Sanghai, Gqa: Training general- santllm: A truly bilingual french-english language ized multi-query transformer models from multi- model, 2024. URL: https://arxiv.org/abs/2402.00786. head checkpoints, arXiv preprint arXiv:2305.13245 arXiv:2402.00786 . (2023). [18] G. Penedo, H. Kydlíček, L. B. allal, A. Lozhkov, [24] R. Child, S. Gray, A. Radford, I. Sutskever, Generat- M. Mitchell, C. Raffel, L. V. Werra, T. Wolf, The ing long sequences with sparse transformers, arXiv fineweb datasets: Decanting the web for the finest preprint arXiv:1904.10509 (2019). text data at scale, 2024. URL: https://arxiv.org/abs/ [25] I. Beltagy, M. E. Peters, A. Cohan, Longformer: 2406.17557. arXiv:2406.17557 . The long-document transformer, arXiv preprint [19] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, arXiv:2004.05150 (2020). M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku- [26] G. Gerganov, llama.cpp: Inference of meta’s llama mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. A. model (and others) in pure c/c++, ???? URL: https: Cosgrove, C. D. Manning, C. Re, D. Acosta-Navas, //github.com/ggerganov/llama.cpp. D. A. Hudson, E. Zelikman, E. Durmus, F. Lad- [27] T. Dao, FlashAttention-2: Faster attention with hak, F. Rong, H. Ren, H. Yao, J. WANG, K. San- better parallelism and work partitioning, in: Inter- thanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suz- national Conference on Learning Representations gun, N. Kim, N. Guha, N. S. Chatterji, O. Khat- (ICLR), 2024. tab, P. Henderson, Q. Huang, R. A. Chi, S. M. Xie, [28] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. S. Santurkar, S. Ganguli, T. Hashimoto, T. Icard, Yu, J. E. Gonzalez, H. Zhang, I. Stoica, Efficient mem- T. Zhang, V. Chaudhary, W. Wang, X. Li, Y. Mai, ory management for large language model serving Y. Zhang, Y. Koreeda, Holistic evaluation of lan- with pagedattention, in: Proceedings of the ACM guage models, Transactions on Machine Learn- SIGOPS 29th Symposium on Operating Systems ing Research (2023). URL: https://openreview.net/ Principles, 2023. forum?id=iO4LZibEqW, featured Certification, Ex- [29] I. Loshchilov, F. Hutter, Decoupled weight decay pert Certification. regularization, arXiv preprint arXiv:1711.05101 [20] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy- (2017). Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, [30] L. Moroni, S. Conia, F. Martelli, R. Navigli, ITA- T. Liu, M. Tian, D. Kocetkov, A. Zucker, Y. Belkada, Bench: Towards a more comprehensive evaluation Z. Wang, Q. Liu, D. Abulkhanov, I. Paul, Z. Li, W.-D. for Italian LLMs, in: CLiC-it, 2024. Li, M. Risdal, J. Li, J. Zhu, T. Y. Zhuo, E. Zheltonozh- [31] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sab- skii, N. O. O. Dade, W. Yu, L. Krauß, N. Jain, Y. Su, harwal, C. Schoenick, O. Tafjord, Think you have X. He, M. Dey, E. Abati, Y. Chai, N. Muennighoff, solved question answering? try arc, the ai2 rea- X. Tang, M. Oblokulov, C. Akiki, M. Marone, C. Mou, soning challenge, arXiv preprint arXiv:1803.05457 M. Mishra, A. Gu, B. Hui, T. Dao, A. Zebaze, (2018). O. Dehaene, N. Patry, C. Xu, J. McAuley, H. Hu, [32] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, T. Scholak, S. Paquet, J. Robinson, C. J. Anderson, M. Collins, K. Toutanova, Boolq: Exploring the N. Chapados, M. Patwary, N. Tajbakhsh, Y. Jer- surprising difficulty of natural yes/no questions, nite, C. M. Ferrandis, L. Zhang, S. Hughes, T. Wolf, arXiv preprint arXiv:1905.10044 (2019). A. Guha, L. von Werra, H. de Vries, Starcoder [33] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, 2 and the stack v2: The next generation, 2024. H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, arXiv:2402.19173 . R. Nakano, et al., Training verifiers to solve math [21] Y. Elazar, A. Bhagia, I. H. Magnusson, A. Ravichan- word problems, arXiv preprint arXiv:2110.14168 der, D. Schwenk, A. Suhr, E. P. Walsh, D. Groen- (2021). eveld, L. Soldaini, S. Singh, H. Hajishirzi, N. A. [34] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, Hellaswag: Can a machine really finish your sen- terpreting language model as a regularized encoder- tence?, arXiv preprint arXiv:1905.07830 (2019). decoder, arXiv preprint arXiv:2304.04052 (2023). [35] D. Hendrycks, C. Burns, S. Basart, A. Zou, [46] X. Garcia, Y. Bansal, C. Cherry, G. Foster, M. Krikun, M. Mazeika, D. Song, J. Steinhardt, Measuring mas- M. Johnson, O. Firat, The unreasonable effective- sive multitask language understanding, Proceed- ness of few-shot learning for machine translation, ings of the International Conference on Learning in: International Conference on Machine Learning, Representations (ICLR) (2021). PMLR, 2023, pp. 10867–10878. [36] Y. Bisk, R. Zellers, J. Gao, Y. Choi, et al., Piqa: [47] N. Goyal, C. Gao, V. Chaudhary, P.-J. Chen, G. Wen- Reasoning about physical commonsense in natural zek, D. Ju, S. Krishnan, M. Ranzato, F. Guzmán, language, in: Proceedings of the AAAI confer- A. Fan, The flores-101 evaluation benchmark for ence on artificial intelligence, volume 34, 2020, pp. low-resource and multilingual machine translation, 7432–7439. Transactions of the Association for Computational [37] J. Welbl, N. F. Liu, M. Gardner, Crowdsourcing Linguistics 10 (2022) 522–538. multiple choice science questions, arXiv preprint [48] B. Zhang, P. Williams, I. Titov, R. Sennrich, Improv- arXiv:1707.06209 (2017). ing massively multilingual neural machine trans- [38] S. Lin, J. Hilton, O. Evans, Truthfulqa: Measur- lation and zero-shot translation, arXiv preprint ing how models mimic human falsehoods, arXiv arXiv:2004.11867 (2020). preprint arXiv:2109.07958 (2021). [49] R. Navigli, S. Conia, B. Ross, Biases in large lan- [39] K. Sakaguchi, R. L. Bras, C. Bhagavatula, Y. Choi, guage models: Origins, inventory, and discussion, Winogrande: An adversarial winograd schema chal- J. Data and Information Quality 15 (2023) 1–21. lenge at scale, Communications of the ACM 64 URL: https://doi.org/10.1145/3597307. doi:10.1145/ (2021) 99–106. 3597307 . [40] F. Mercorio, M. Mezzanzanica, D. Potertì, A. Serino, [50] S. Conia, M. Li, D. Lee, U. Minhas, I. Ilyas, A. Seveso, Disce aut deficere: Evaluating llms Y. Li, Increasing coverage and precision of tex- proficiency on the INVALSI Italian benchmark, tual information in multilingual knowledge graphs, 2024. URL: https://arxiv.org/abs/2406.17535. in: H. Bouamor, J. Pino, K. Bali (Eds.), Pro- arXiv:2406.17535 . ceedings of the 2023 Conference on Empirical [41] X. V. Lin, T. Mihaylov, M. Artetxe, T. Wang, Methods in Natural Language Processing, Asso- S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhos- ciation for Computational Linguistics, Singapore, ale, J. Du, R. Pasunuru, S. Shleifer, P. S. Koura, 2023, pp. 1612–1634. URL: https://aclanthology. V. Chaudhary, B. O’Horo, J. Wang, L. Zettlemoyer, org/2023.emnlp-main.100. doi:10.18653/v1/2023. Z. Kozareva, M. Diab, V. Stoyanov, X. Li, Few- emnlp- main.100 . shot learning with multilingual generative lan- [51] S. Conia, D. Lee, M. Li, U. F. Minhas, S. Potdar, guage models, in: Y. Goldberg, Z. Kozareva, Y. Li, Towards cross-cultural machine translation Y. Zhang (Eds.), Proceedings of the 2022 Con- with retrieval-augmented generation from multi- ference on Empirical Methods in Natural Lan- lingual knowledge graphs, in: Proceedings of the guage Processing, Association for Computational 2024 Conference on Empirical Methods in Natu- Linguistics, Abu Dhabi, United Arab Emirates, ral Language Processing, Association for Computa- 2022, pp. 9019–9052. URL: https://aclanthology. tional Linguistics, Miami, Florida, USA, 2024. URL: org/2022.emnlp-main.616. doi:10.18653/v1/2022. https://arxiv.org/abs/2410.14057. emnlp- main.616 . [42] S. Mehta, M. H. Sekhavat, Q. Cao, M. Horton, Y. Jin, C. Sun, I. Mirzadeh, M. Najibi, D. Belenko, P. Zat- loukal, M. Rastegari, OpenELM: An Efficient Lan- A. Data sources guage Model Family with Open Training and In- Table 6 shows the source of each dataset used to train ference Framework, arXiv.org (2024). URL: https: Minerva in its different sizes. The Tokens column shows //arxiv.org/abs/2404.14619v1. the total number of tokens we used from each dataset. [43] G. Sarti, M. Nissim, It5: Large-scale text-to-text Where Table 1 shows more tokens used for training, it pretraining for italian language understanding and means they were resampled from the total in order to generation, arXiv preprint arXiv:2203.03759 (2022). reach that number. All these datasets are openly licensed. [44] N. Landro, I. Gallo, R. La Grassa, E. Federici, Two new datasets for italian-language abstractive text summarization, Information 13 (2022) 228. [45] Z. Fu, W. Lam, Q. Yu, A. M.-C. So, S. Hu, Z. Liu, N. Collier, Decoder-only or encoder-decoder? in- Dataset Tokens Language Genre URL RedPajama-Data-V2 688B Italian Web https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2 CulturaX 158B Italian Web https://huggingface.co/datasets/uonlp/CulturaX Wikipedia 1.3B Italian Encyclopedic https://huggingface.co/datasets/wikimedia/wikipedia Gutenberg 0.15B Italian Books https://huggingface.co/datasets/manu/project_gutenberg Wikisource 0.12B Italian Books https://huggingface.co/datasets/wikimedia/wikisource EurLex 1.6B Italian Law https://huggingface.co/datasets/joelito/eurlex_resources Gazzetta Ufficiale 1.7B Italian Law https://huggingface.co/datasets/mii-llm/gazzetta-ufficiale FineWeb 1,076B English Web https://huggingface.co/datasets/HuggingFaceFW/fineweb CulturaX 330B English Web https://huggingface.co/datasets/uonlp/CulturaX Wikipedia 5.3B English Encyclopedic https://huggingface.co/datasets/wikimedia/wikipedia ArXiv 33B English Academic https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T Gutenberg 7B English Books https://huggingface.co/datasets/manu/project_gutenberg StackExchange 22B English Forum https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T The Stack V2 201B Code Code https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids Table 6 Detailed breakdown of each dataset. B. Dataset Insights to 7 billion parameters of the largest model, Minerva- 7B. The Minerva family also includes Minerva-1B and We leveraged the WIMBD5 library to compute word Minerva-3B, with 1 billion and 3 billion parameters, re- counts per URL domain on CulturaX. We decided not spectively. More specifically, the Minerva-7B model is to do this for RedPajama v2 or FineWeb as their origi- based directly on the Mistral-7B architecture, with the nal data already provides token count and other insights sole modifications being the vocabulary size, which we into the dataset distribution. Figures 2 and 3 show the increase to 51,200 tokens, and the context length, which is aggregation of word counts per domain for Italian and set to 4,096 tokens without activating the sliding window English, respectively. attention feature. Hence, Minerva-7B is structured as a decoder-only transformer model, comprising 32 layers. Each layer includes 32 attention heads, where each key- C. Tokenizer value pair is shared among four queries. Additionally, the model features feed-forward layers with a hidden We trained two tokenizers for Minerva. The first one is size of 4096 and an intermediate size of 14336, which is shared by the three smaller sizes, 350M, 1B and 3B. It is 3.5 times the hidden size. Minerva-3B is a scaled down trained on a mix of 4GB of Italian text data and 4GB of version of Minerva-7B, and it shares similar features with English text data, both from CulturaX. Our objective is Mistral-7B, including a maximum context length of 16,384 to have a balanced vocabulary across the two languages, tokens, sliding window attention spanning 2,048 tokens, mirroring the training data. We use the SentencePiece and a vocabulary size of 32,768 tokens. To achieve ap- library6 to train a BPE tokenizer and we apply byte fall- proximately 3 billion parameters, we have reduced the back. We set a vocabulary size of 32,768 as a multiple of hidden size to 2560 and the intermediate size to 8960. 8, which is recommended by some GPU architectures. Minerva-1B and Minerva-350M differ from their larger For the 7B tokenizer, we increase the vocabulary size counterpart in several key respects. Both models have to account for the inclusion of code data, up to 51,200. 16 attention heads, in contrast to the higher count in the We also train a BPE tokenizer7 with 4GB of English text, larger model. Additionally, the hidden and intermedi- 4GB of Italian and 1GB of code. The text data is sampled ate sizes of the feed-forward layers is reduced further: from the training mix of datasets for the 7B, as reported Minerva-1B features a hidden size of 2048 and an interme- in Table 1. diate size of 7168, while Minerva-350M has a hidden size of 1152 and an intermediate size of 4032. The complete D. Model list of parameters is reported in Table 3. The Minerva LLM family consists of four models, each sharing the same underlying architecture, i.e., that of E. News Summarization Mistral-7B. The models are differentiated by their size, Additional results. Table 8 reports the full results of ranging from 350 million parameters of Minerva-350M our evaluation on news summarization. 5 https://github.com/allenai/wimbd 6 https://github.com/google/sentencepiece 7 https://huggingface.co/docs/tokenizers/en/api/trainers Top-50 source URLs (# of words) docplayer.it radioradicale.it tripadvisor.it it.wikipedia.org it.blastingnews.com it.m.wikipedia.org ilsussidiario.net treccani.it sentenze.laleggepertutti.it ilgiornale.it slideplayer.it it.scribd.com 247.libero.it ilfattoquotidiano.it kijiji.it issuu.com repubblica.it medicitalia.it corriere.it it.paperblog.com lastampa.it blitzquotidiano.it huffingtonpost.it tg24.sky.it immobiliare.it URL it.jooble.org it.topwar.ru laleggepertutti.it ibs.it ilsole24ore.com slideshare.net it.knowledgr.com informazione.it tuttosu.virgilio.it spaziogames.it it.notizie.yahoo.com airbnb.it caasa.it blog.giallozafferano.it icase.it jetcost.it amazon.it affaritaliani.it emagister.it it.venere.com ilfoglio.it ansa.it sport.sky.it artribune.com documenti.camera.it 0.000 0.002 0.004 0.006 0.008 0.010 Percentage of words from each URL over the total 117B words Figure 2: Domain word count distribution for Italian CulturaX. Model Optimizer lr betas eps Weight Decay Scheduler Warm-up Batch Size Steps Minerva-350M AdamW 2 × 10−4 (0.9, 0.95) 10−8 0.0 Cosine 2% 4𝑀 16, 690 Minerva-1B AdamW 2 × 10−4 (0.9, 0.95) 10−8 0.0 Cosine 2% 4𝑀 47, 684 Minerva-3B AdamW 2 × 10−4 (0.9, 0.95) 10−8 0.0 Cosine 2% 4𝑀 157, 357 Minerva-7B AdamW 3 × 10−4 (0.9, 0.95) 10−5 0.1 Cosine 2000 4𝑀 591, 558 Table 7 Training configuration for various Minerva models. Additional details on the experimental setup. To reported parameters. Furthermore, we want to highlight finetune our Minerva models we relied on the SFTTrainer that Minerva-350M and Minerva-1B were finetuned using class.8 The hyperparameters we used are reported in AdamW optimizer [29]. Minerva-3B was trained using Table 9. We sought to be in-line with the decisions taken AdamW_Paged_32bit, a lighter version of AdamW, which in [43]. We also tried out different combinations, but we allows a larger batch size to be used during training. noticed that the best evaluation scores are given by the 8 https://huggingface.co/docs/trl/en/sft_trainer Top-50 source URLs (# of words) google.com issuu.com scribd.com tripadvisor.com patents.google.com docplayer.net en.wikipedia.org tripadvisor.co.uk slideshare.net amazon.com nytimes.com dailymail.co.uk scout.com stackoverflow.com archive.org patentsencyclopedia.com journals.plos.org theguardian.com frontiersin.org seekingalpha.com law.justia.com ncbi.nlm.nih.gov washingtonpost.com slideplayer.com link.springer.com URL nature.com fanfiction.net barnesandnoble.com reddit.com hindawi.com patents.justia.com amazon.co.uk medium.com finance.yahoo.com hubpages.com s3-us-west-1.amazonaws.com shameface.com.wstub.archive.org casetext.com thefreelibrary.com bleacherreport.com rightmove.co.uk prweb.com airbnb.co.uk semanticscholar.org articles.chicagotribune.com expedia.com studymode.com forums.macrumors.com telegraph.co.uk publications.parliament.uk 0.000 0.001 0.002 0.003 0.004 Percentage of words from each URL over the total 2096B words Figure 3: Domain word count distribution for English CulturaX. F. Few-shot Machine Translation erence, as COMET has shown better correlation with human judgement than other metrics, such as BLEU. Here, we provide more details on our experimental setup for the Machine Translation task. In our experiments, we test the capability of a base model (i.e., with no instruc- tion fine-tuning or task-specific fine-tuning) to translate a sentence from English to Italian and vice versa. Previ- ously, LLMs have been shown to perform well in machine translation and they now rival task-specific MT systems on a number of benchmarks [50] and tasks [51]. In our case, we prompt the language models by providing a set of 5 randomly sampled English-to-Italian translations (and vice-versa for the Italian-to-English translation). Fi- nally, we measure the translation performance of the models using COMET, a learned metric to assess the quality between an automatic translation and a gold ref- Model R1 ↑ R2 ↑ RL ↑ mBART Large 0.32 0.15 0.25 mT5 Small 0.34 0.16 0.26 mT5 Base 0.33 0.16 0.26 IT5 Small 0.35 0.17 0.28 IT5 EL32 0.34 0.16 0.26 IT5 Base 0.25 0.10 0.20 IT5 Large 0.38 0.19 0.29 Minerva-350M 0.35 0.17 0.27 Minerva-1B 0.35 0.17 0.27 Minerva-3B 0.39 0.20 0.30 Table 8 Rouge metrics of News Summarization fine-tuning. Parameter Value warmup ratio 0.2 weight decay 5 × 10−3 batch size 64 optimizer AdamW | PagedAdamW 32bit (only 3B) learning rate 0.0005 scheduler Linear epochs 7 Table 9 Hyper-parameters used to fine-tune our models.