Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data

Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data RiccardoOrlando orlando@diag.uniroma1.it Dipartimento di Ingegneria Informatica Automatica e Gestionale Sapienza NLP Group Sapienza University of Rome

Italy

LucaMoroni moroni@diag.uniroma1.it Dipartimento di Ingegneria Informatica Automatica e Gestionale Sapienza NLP Group Sapienza University of Rome

Italy

Pere-LluísHuguet Cabot huguetcabot@diag.uniroma1.it Dipartimento di Ingegneria Informatica Automatica e Gestionale Sapienza NLP Group Sapienza University of Rome

Italy

EdoardoBarba barba@diag.uniroma1.it Dipartimento di Ingegneria Informatica Automatica e Gestionale Sapienza NLP Group Sapienza University of Rome

Italy

SimoneConia conia@diag.uniroma1.it Dipartimento di Ingegneria Informatica Automatica e Gestionale Sapienza NLP Group Sapienza University of Rome

Italy

SergioOrlandini s.orlandini@cineca.it CINECA

Bologna Italy

GiuseppeFiameni gfiameni@nvidia.com NVIDIA

Santa Clara California USA

RobertoNavigli navigli@diag.uniroma1.it Dipartimento di Ingegneria Informatica Automatica e Gestionale Sapienza NLP Group Sapienza University of Rome

Italy

Tenth Italian Conference on Computational Linguistics

Dec 04 -06 2024 Pisa Italy

Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data 1613-0073 80F99ADF63E2489B537AA61E179BCA87 GROBID - A machine learning software for extracting information from scholarly documents Large Language Models Language Modeling Italian Language LLM Pretraining

The growing interest in Large Language Models (LLMs) has accelerated research efforts to adapt these models for various languages. Despite this, pretraining LLMs from scratch for non-English languages remains underexplored. This is the case for Italian, where no truly open-source research has investigated the pretraining process. To address this gap, we introduce Minerva (https://nlp.uniroma1.it/minerva), the first family of LLMs trained entirely from scratch on native Italian texts. Our work is the first investigation into the challenges and opportunities of pretraining LLMs specifically for the Italian language, offering insights into vocabulary design, data composition, and model development. With Minerva, we demonstrate that building an LLM tailored to a specific language yields numerous practical benefits over adapting existing multilingual models, including greater control over the model's vocabulary and the composition of its training data. We provide an overview of the design choices, pretraining methods, and evaluation metrics used to develop Minerva, which shows promising performance on Italian benchmarks and downstream tasks. Moreover, we share the lessons learned throughout Minerva's development to support the academic and industrial communities in advancing non-English LLM research. We believe that Minerva serves as an important step towards closing the gap in high-quality, open-source LLMs for non-English languages.

Introduction

Large Language Models (LLMs) have revolutionized the way Natural Language Processing (NLP) tasks are approached, achieving remarkable results in existing areas and opening the door to entirely new research directions and applications. As a result, the energy and resources dedicated to the study and creation of LLMs are growing exponentially. However, most LLMs -both closedand open-source -are predominantly designed for English, posing significant challenges and limitations for their use in non-English settings. In practice, generating Italian text using multilingual or language-adapted English models, e.g., from Mistral [1] or Llama [2,3], is computationally more expensive and often less effective compared to using a model specifically designed for the Italian language. This inefficiency stems from the vocabulary of an English or multilingual LLM -i.e., the lexical units, or tokens, that the model can use to compose text -when it is not optimized for the Italian language, resulting in Italian words being split into an excessive number of tokens. Consequently, this creates longer sequences of tokens, slower generation times, and higher computational costs, especially since many popular attention mechanisms have a quadratic complexity with respect to sequence length.

Efforts to create language-specific LLMs are increasing, and fall primarily into two main categories: i) adapting existing English-centric LLMs to other languages, and ii) training LLMs from scratch. The advantages of adapting existing English-centric LLMs to other languages are enticing: starting with a proven model can reduce the computational requirements, and adaptation can be achieved with relatively modest amounts of data. There are several language adaptation techniques, which range from fine-tuning the model on data for the target language [4,5] to modifying the model's architecture [6,7,8], making these techniques flexible for different budgets and objectives. However, these techniques may not fully capture language-specific nuances and can degrade the performance in the original language, indeed an undesirable effect. Alternatively, training LLMs from scratch provides the freedom to make design choices tailored to the linguistic features of the target language-including morphology, lexicon, syntax, and semantics-which are often overlooked in English-centric models [9]. It also allows for incorporating culturally relevant content, reducing biases that might be present in models primarily trained on English data, thus leading to more inclusive and accurate representations of language use. Unfortunately, while there are several efforts on adapting English-centric LLMs to the Italian language, e.g., Llamantino-2 [4], Llamantino-3 [5], DanteLLM [10], and Camoscio [11], inter alia, there is no truly open-source endeavor exploring what can be achieved by training an LLM from scratch on Italian data.

With this work, we follow the latter path and introduce Minerva, the first family of LLMs designed specifically for the Italian language and pretrained on Italian text. 1We present the design choices for our models, our data processing, and the evaluation results regarding our Minerva LLMs, showing that our models -with 350M, 1B, 3B, and 7B parameters -outperform comparable multilingual models and even rival larger models adapted for Italian. We conclude with a discussion on the benefits and challenges of pretraining LLMs from scratch for the Italian language, sharing our experience and findings to provide valuable insights for the academic and industrial communities interested in training non-English LLMs from scratch. Lastly, we describe the technical details of Minerva-7B, our latest model with 7.4 billion parameters, for which we share our initial results.

Building a Pretraining Dataset for Italian LLMs

The field of LLMs is growing at an astonishing pace, with new models, datasets, benchmarks, and techniques presented every week. However, over the past few months, academic and industrial researchers have increasingly recognized the fundamental role of the data used to pretrain LLMs. Unsurprisingly, the majority of the leading companies are not releasing their training data as they seek to maintain an advantage over the competition, with very few exceptions (e.g. OLMo by AllenAI [12] and OpenELM by Apple [13]). In this section, we describe the different sources of data used in the training of the Minerva models, and Table 1 provides an overview of these (cf. Appendix A for more details). Most importantly, the training datasets we used are entirely available online, making our process transparent and allowing researchers to better study the connection between pretraining data and model behavior.

Data Sources

The training data for our Minerva models consists of three main categories: Italian, English, and code data. Datasets used to train Minerva with their languages (second column) and number of tokens (third to sixth columns).

We only use the code data to train our largest model, i.e., Minerva-7B.

Italian Data

Web data. The majority of the text used to train LLMs is sourced from Web-scraped data, typically from Com-monCrawl (CC). Therefore, a significant portion of Italian text included in our training datasets is also of this nature, inherently exposing our models to potential biases and toxic content commonly found on the Web. Because preprocessing techniques, such as language identification, perplexity filtering, deduplication, and content classification are computationally expensive, the most sensible choice is thus to rely on preprocessed collections, such as CulturaX [14] and RedPajama v2 [15]. These collections already include Italian data, and have undergone various levels of filtering and deduplication, as discussed in Section 2.2.

Curated data. While Penedo et al. [16] suggest that high-quality Web data is sufficient on its own to train LLMs, curated data sources are often used to further improve the model performance and introduce a broader diversity of data types, such as encyclopedic and academic text [17], as well as scientific and math-related text. Therefore, we include curated texts from several sources, including Wikipedia (encyclopedic/world knowledge data), EurLex and Gazzetta Ufficiale (law, economics, and politics), and the Gutenberg Project (novels, poetry, etc.).

English Data

Web data. Mirroring our approach with the Italian data, we use preprocessed collections of English data from the Web. Given that English is the most popular language on the Internet and has been the primary focus of LLM research, there are numerous options that already provide a large amount of tokens from filtered, deduplicated, and cleaned sources. For our Minerva-350M, 1B, and 3B models, we collect data from the English partition of CulturaX, capping the number of tokens to the same amount as the Italian ones, as shown in Table 1. Instead, to train Minerva-7B, we use a portion of FineWeb [18], which includes filtered and deduplicated CC dumps with various timestamps. Specifically, we use the CC dumps from 2023-14 to 2024-18 to match the total number of tokens in the Italian Web partition of our training data.

Curated sources. We include the 5.3B tokens from the English Wikipedia and 7B tokens from the copyrightfree books in Project Gutenberg. Additionally, we include data from arXiv and StackExchange, which are included in the RedPajama dataset.

Code Data

Previous work has highlighted the importance of including source code in the pretraining corpus of an LLM, in order to improve not only its code understanding and generation, but also its general reasoning capabilities [19] even for tasks that do not directly involve or require programming. Therefore, for our largest model -Minerva-7B -we also include a portion of code data. More specifically, we extract 200B tokens from The Stack V2 [20], selecting the data from their deduplicated partition, which includes 17 of the most popular programming languages on GitHub.

Data Preprocessing

As mentioned above, our preprocessing effort remains minimal, as we rely on the preprocessing pipelines used in CulturaX, RedPajama, and FineWeb. To evaluate the content and quality of our training data, we employ the methodology described in Elazar et al. [21] to analyze the URL domain distribution within the Italian partition of CulturaX and RedPajama, as these partitions had never been utilized in training an LLM prior to Minerva. We provide an overview of our analysis together with a few insights in Appendix B.

Data Filtering and Deduplication

Previous work on English-centric LLMs [22] has already emphasized the importance of training LLMs on "clean" data. Two of the most important parts of data cleaning are filtering, i.e., removing content that does not satisfy a set of criteria, and deduplication, i.e., removing portions of text that appear too often so as to minimize memorization.

As mentioned above, for the corpus used to train the Minerva models, we rely mainly on collections of data that has already been filtered and deduplicated. However, there are some minor considerations that depend on each collection of data. More specifically, we use CulturaX as-is, relying on their filtering and deduplication pipeline. Unfortunately, RedPajama v2 is not filtered and deduplicated; however, its data is tagged with meta-information that can be used to apply filtering and deduplication. Such metadata includes, for example, the perplexity score of each text computed via a language model trained on Wikipedia, which is used to partition RedPajama v2 into three partitions: head, middle, tail. For our training corpus, we only include a document if it is classified as head or middle according to its perplexity score. Moreover, we use the precomputed metadata to remove exact duplicates and apply fuzzy deduplication. The latter is performed by using the hash provided for each document with Locality Sensitive Hashing and Jaccard similarity 0.7 to decide whether two documents are fuzzy duplicates. Note that we only apply fuzzy deduplication within each CC dump, rather than across all the dumps. This decision is motivated by two observations: first, applying fuzzy deduplication across all CC dumps is computationally expensive; second, previous work [18] has shown that per-CC deduplication is not only sufficient, but is also beneficial, when training English LLMs.

Minerva LLMs

In this section, we provide an overview of the Minerva LLMs: we describe their tokenizers, the design choices behind the model architecture, and how we trained the resulting LLMs.

Vocabulary and Tokenizers

The vocabulary of an LLM is mainly impacted by its size, i.e., the number of tokens in the vocabulary itself, and how the tokenizer is trained, i.e., which tokens make up the vocabulary. These two factors impact the fertility of the resulting tokenizer, which measures the average number of tokens (subwords) into which a word is split. Tokenizers with lower fertility are preferable, as the input and output sequences they produce are shorter, resulting in an efficiency gain, especially as most attention mechanisms are quadratic with respect to the sequence length. Unsurprisingly, the vocabulary allocation of an English-centric LLM minimizes the fertility of English text, and results in high fertility values for Italian text, as shown in Table 2. Given the importance for our Minerva LLMs of having a low fertility on Italian text, we intentionally train the Minerva tokenizer on a balanced mix of English and Italian data (and code data for the 7B model). Our analysis shows that this strategy leads to a much improved fertility on Italian data, while at the same time maintaining similar fertility on English data. More specifically, for Minerva-350M/1B/3B, we opted for a vocabulary size similar to that of Mistral-7B (around 32k tokens): in this case, the fertility of the Minerva tokenizer is ~20% better than the Mistral tokenizer on the Italian Wikipedia and only ~1% worse on the English Wikipedia. Following recent trends in LLMs, for Minerva-7B, we increased the vocabulary size to around 50k tokens, which resulted in a further fertility improvement of ~6% and ~5% on the Italian and English Wikipedias, respectively, notwithstanding the addition of code data to the training data. We provide more details on the tokenizer in Appendix C.

Model Architecture

While the field of LLMs is moving rapidly, one of the best models when our efforts started was Mistral. Therefore, our Minerva LLMs are based on Mistral's model architecture. The Minerva LLMs are, therefore, a family of decoder-only transformer models, with a few standout features, such as grouped-query attention (GQA) [23], which boosts inference speed and reduces memory requirements for increased throughput, and sliding window attention (SWA) [24,25], which manages longer sequences more efficiently at reduced computational costs. Specifically, the GQA is configured to share one key-value pair every four queries, while the SWA configuration handles up to 2,048 tokens with a maximum context length of 16,384 tokens. We build four models with different sizes by scaling the number of attention heads, hidden size, intermediate size, and hidden layers, while maintaining a ratio of ~3.5 between the hidden size and intermediate size, as in the original Mistral model. However, following the more recent model releases by Mistral, Minerva-7B does not use SWA. Instead, it implements full attention across its entire context length, which can extend up to 4096 tokens, i.e., double the number of tokens for the SWA used in Minerva-350M/1B/3B. The parameters for each model size are detailed in Table 3, for which we provide a more in-depth description in Appendix D.

Building Minerva on top of Mistral's model architecture also brings other benefits, such as broad compatibility with the ecosystem of libraries, frameworks, and tools that has emerged over recent months, including llama.cpp [26], FlashAttention [27], and vLLM [28].

Model Training

We train all the Minerva LLMs using MosaicML's LLM Foundry. 2 The training process is conducted on the Leonardo Supercomputer3 hosted and maintained by CINECA. Each node in Leonardo is equipped with 4 × custom NVIDIA A100 SXM4 with 64GB of VRAM.

All our models are trained using the AdamW optimizer [29] with 𝛽 1 = 0.9, 𝛽 2 = 0.95, 𝑒𝑝𝑠 = 10 −8 (with the only exception being Minerva-7B, which is trained using 𝑒𝑝𝑠 = 10 −5 ) on a standard causal language modeling training objective. To smooth the training process, we follow standard practice in the literature and employ a warmup-then-cooldown learning rate scheduling. More specifically, we first increase the learning rate linearly during the initial training phase (2% of the total number of training steps for Minerva-350M/1B/3B and 0.3% for Minerva-7B) until the peak learning rate is reached (2×10 −4 for Minerva-350M/1B/3B, 3×10 −4 for Minerva-7B), and then decrease the learning rate with a cosine scheduling until the end of the training process. The hyperparameters used for each model are shown in Table 7.

Evaluation

We measure the 0-shot performance of our Minerva LLMs on ITA-Bench [30], a suite of benchmarks that have been created either by translating existing benchmarks from other languages, or by adapting existing Italian benchmarks so that they can be used for LLM evaluation. ITA-Bench includes a set of 10 benchmarks commonly used to evaluate LLMs, namely, ARC Challenge (ARC-C), ARC Easy (ARC-E) [31], BoolQ [32], GSM8K [33], HellaSwag (HS) [34], MMLU [35], PIQA [36], SciQ [37], TruthfulQA [38], and Winogrande (WG) [39]. Overall, these benchmarks offer a comprehensive view of the capabilities of an LLM on a wide variety of aspects, including scientific knowledge, world knowledge (e.g., geography, politics, economics), commonsense knowledge, physical

Table 4

Zero-shot evaluation results of the Minerva models on a set of standard benchmarks translated from English to Italian.

interactions, coreference, and math reasoning, among others. Employing automatically-translated benchmarks is far from ideal, but it allows us to better compare the scores obtained in Italian with those obtained in English, while awaiting as the Italian research community develops Italian-specific benchmarks [40].

As shown in Table 4, the average performance of the Minerva models increases steadily with the model size. For our 3B model, we also provide a comparison with two models of the same size: XGLM [41], a multilingual LLM by META, and OpenELM [42], a very recent Englishonly model developed by Apple. Our evaluation shows that Minerva-3B outperforms XGLM and OpenELM by a significant margin, i.e., +4.4% and +3.7% on average.

Finally, Minerva-7B achieves the highest performance among the Minerva LLMs family, as expected. Notably, Minerva-7B, achieves a higher average score than Llamantino-2. This is an interesting comparison because the pretraining data for Llama-2, i.e., the pretrained LLM used to build Llamantino-2, is not available and has never been disclosed, making the model open-weights but not entirely open-source. 4 When compared to closedsourced LLMs such as Mistral-7B-v0.1 or Llama-3.1-8B, Minerva still lags behind in some tasks, such as BoolQ or GSM8K, which may require better reasoning capabilities and/or more pretraining data. As we can observe from Figure 1, which tracks the progress of Minerva-7B 4

We stress that, for Llamantino-2, only the data that has been used for the language adaptation process is available, whereas the pretraining data is not.

on ITA-Bench every 10,000 training steps, the model is still slowly improving towards the end of the pretraining phase, suggesting that a larger training corpus or multiple epochs may be beneficial in future developments.

Downstream tasks

In this section, we show the results of the Minerva models when adapted to two downstream applications. This analysis is particularly relevant for Minerva-350M and Minerva-1B, which can be utilized for specific tasks rather than as general-purpose models, offering lower computational costs. The tasks in this analysis include: i) Italian Abstractive News Summarization, and ii) Machine Translation, in both directions (IT-EN and EN-IT).

News Summarization. Following Sarti and Nissim [43], we fine-tune Minerva models (up to 3B) on a concatenation of two Italian news summarization datasets: Fanpage.it and Il Post newspapers [44]. A detailed overview of the hyperparameters used to train our models is provided in Appendix E. We can find that Minerva-3B obtains the best results (0.30 vs 0.29 of the second best in terms of Rouge-L); however, it is not as parameterefficient as IT5-Large, probably because encoder-decoder models are more suitable for fine-tuning than decoderonly models [45]. In Table 8, we report the full results of Minerva fine-tuned on the aforementioned datasets and compared to baselines in Sarti and Nissim [43], which Machine Translation. We also evaluate our Minerva LLMs in few-shot [46] machine translation on two benchmarks, FLORES [47] and OPUS-100 [48]. We explore how LLMs perform this task relying only on in-contextlearning few-shot examples, reporting our results with 5-shot prompting. We rely on the vLLM library [28] and change the default parameters with temperature=0 and max_tokens=512.

We highlight that Minerva-3B reaches competitive results in MT in both EN-IT (84.8 on Flores and 76.7 on Opus in terms of COMET score) and IT-EN (85.7 and 78.0). Compared with other models of similar size, Minerva-3B shows strong results when the target language is Italian (+1.7 and +2.7 compared to Gemma-2B and Qwen-1.5B on Opus). Minerva-7B further showcases this by achieving the highest performance among models tested when translating from English into Italian. The full results are reported in Table 5.

Conclusion and Future Work

In this paper, we demonstrated the feasibility and benefits of pretraining Italian language models from scratch, which not only improves the computational efficiency and performance of an LLM for a target language but reduce linguistic biases inherited from English training corpora [49]. The Minerva models (https://nlp.uniroma1.it/ minerva) showcase promising results on a variety of Italian benchmarks and downstream tasks, including news summarization and machine translation. Most importantly, we describe, for the first time, the process of creating an Italian pretraining corpus with more than 1T tokens, and we share findings and insights into the pretraining process of Italian LLMs with the academic and industrial communities, paving the way for future research in training non-English language models. We hope that our contributions will represent a stepping stone for future work on language-specific and multilingual large-scale language modeling.

Table 6

Detailed breakdown of each dataset.

B. Dataset Insights

We leveraged the WIMBD5 library to compute word counts per URL domain on CulturaX. We decided not to do this for RedPajama v2 or FineWeb as their original data already provides token count and other insights into the dataset distribution. Figures 2 and 3 show the aggregation of word counts per domain for Italian and English, respectively.

C. Tokenizer

We trained two tokenizers for Minerva. The first one is shared by the three smaller sizes, 350M, 1B and 3B. It is trained on a mix of 4GB of Italian text data and 4GB of English text data, both from CulturaX. Our objective is to have a balanced vocabulary across the two languages, mirroring the training data. We use the SentencePiece library 6 to train a BPE tokenizer and we apply byte fallback. We set a vocabulary size of 32,768 as a multiple of 8, which is recommended by some GPU architectures.

For the 7B tokenizer, we increase the vocabulary size to account for the inclusion of code data, up to 51,200. We also train a BPE tokenizer 7 with 4GB of English text, 4GB of Italian and 1GB of code. The text data is sampled from the training mix of datasets for the 7B, as reported in Table 1.

D. Model

The Minerva LLM family consists of four models, each sharing the same underlying architecture, i.e., that of Mistral-7B. The models are differentiated by their size, ranging from 350 million parameters of Minerva-350M to 7 billion parameters of the largest model, Minerva-7B. The Minerva family also includes Minerva-1B and Minerva-3B, with 1 billion and 3 billion parameters, respectively. More specifically, the Minerva-7B model is based directly on the Mistral-7B architecture, with the sole modifications being the vocabulary size, which we increase to 51,200 tokens, and the context length, which is set to 4,096 tokens without activating the sliding window attention feature. Hence, Minerva-7B is structured as a decoder-only transformer model, comprising 32 layers. Each layer includes 32 attention heads, where each keyvalue pair is shared among four queries. Additionally, the model features feed-forward layers with a hidden size of 4096 and an intermediate size of 14336, which is 3.5 times the hidden size. Minerva-3B is a scaled down version of Minerva-7B, and it shares similar features with Mistral-7B, including a maximum context length of 16,384 tokens, sliding window attention spanning 2,048 tokens, and a vocabulary size of 32,768 tokens. To achieve approximately 3 billion parameters, we have reduced the hidden size to 2560 and the intermediate size to 8960. Minerva-1B and Minerva-350M differ from their larger counterpart in several key respects. Both models have 16 attention heads, in contrast to the higher count in the larger model. Additionally, the hidden and intermediate sizes of the feed-forward layers is reduced further: Minerva-1B features a hidden size of 2048 and an intermediate size of 7168, while Minerva-350M has a hidden size of 1152 and an intermediate size of 4032. The complete list of parameters is reported in Table 3.

E. News Summarization

Additional results. Table 8 reports the full results of our evaluation on news summarization.

Figure 1 :1Figure 1:Tracking the progress of Minerva-7B during its pretraining process. Here, we report the average accuracy on ITA-Bench every 10,000 steps, i.e., every 40B tokens approximately.

Table 11DatasetMinerva -Model SizeNameLang.350M1B3B7BRedPajama-V2Italian---894BCulturaXItalian35B100B330B237BWikipediaItalian---1.3BGutenbergItalian---0.15BWikisourceItalian---0.12BEurLexItalian---1.6BGazzetta UfficialeItalian---1.7BFineWebEnglish---1,076BCulturaXEnglish35B100B330B-WikipediaEnglish---5.3BArXivEnglish---33BGutenbergEnglish---7BStackExchangeEnglish---22BThe Stack V2Code---201BTotal # of tokens70B200B 660B2.48T

Table 22Fertility rates (lower is better) for Minerva tokenizers compared to other LLMs. The fertility rates are computed on a randomly sampled collection of texts from CulturaX and Wikipedia in both Italian (Ita) and English (Eng).

Model Params Layers Hidden Size Inter. Size Att. Heads KV Heads SW Length Ctx. Length

Minerva-350M352M1611524032164204816,384Minerva-1B1.01B1620487168164204816,384Minerva-3B2.89B3225608960328204816,384Minerva-7B7.40B32409614336328None4,096

Table 33Overview of the main hyperparameters for our Minerva models. We include the number of parameters (approximately, 350M, 1B, 3B, and 7B) and the corresponding number of layers, hidden size, intermediate size, attention heads, key-value heads, sliding window length, and maximum context length.Size NameARC-C ARC-E BoolQ GSM8KHSMMLU PIQA SciQ TQA WG AVG0.4B Minerva-350M-base-v1.024.636.460.748.232.625.759.563.746.558.445.61B Minerva-1B-base-v1.026.642.257.149.739.627.062.973.544.660.048.33B OpenELM-3B27.037.960.949.740.728.356.781.847.358.448.93B XGLM-2.9B27.541.459.165.744.527.459.977.843.160.250.63B Minerva-3B-base-v1.031.449.162.155.852.929.266.979.941.462.253.17B OLMo-7B-0724-hf30.744.072.952.547.930.958.785.144.661.252.87B LLaMAntino-2-7b33.750.870.952.254.933.864.486.144.364.155.57B Minerva-7B-base-v1.042.068.879.550.062.636.269.887.738.565.060.07B Mistral-7B-v0.142.861.378.256.160.438.065.590.843.568.860.58B Llama-3.1-8B44.061.178.057.862.938.767.790.343.069.261.3

Table 55COMET scores measure the translation capabilities of our Minerva models and other LLMs on the FLORES and OPUS datasets. This evaluation is conducted in a 5-shot setting, where each model receives five random translation examples from the development set before the test instance.https://nlp.uniroma1.it/minervahttps://github.com/mosaicml/llm-foundryhttps://leonardo-supercomputer.cineca.eu/https://github.com/allenai/wimbdhttps://github.com/google/sentencepiecehttps://huggingface.co/docs/tokenizers/en/api/trainers

Acknowledgments

Edoardo Barba, Simone Conia and Pere-Lluís Huguet Cabot are fully funded by the PNRR MUR project PE0000013-FAIR. Roberto Navigli acknowledges the support of the CREATIVE PRIN project. The authors acknowledge the CINECA award IsB28_medit under the ISCRA initiative for the availability of high-performance computing resources and support.

Code Code

https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids

A. Data sources

Table 6 shows the source of each dataset used to train Minerva in its different sizes. The Tokens column shows the total number of tokens we used from each dataset. Where Table 1 shows more tokens used for training, it means they were resampled from the total in order to reach that number. All these datasets are openly licensed.

Table 7

Training configuration for various Minerva models.

Additional details on the experimental setup. To finetune our Minerva models we relied on the SFTTrainer class. 8 The hyperparameters we used are reported in Table 9. We sought to be in-line with the decisions taken in [43]. We also tried out different combinations, but we noticed that the best evaluation scores are given by the

F. Few-shot Machine Translation

Here, we provide more details on our experimental setup for the Machine Translation task. In our experiments, we test the capability of a base model (i.e., with no instruction fine-tuning or task-specific fine-tuning) to translate a sentence from English to Italian and vice versa. Previously, LLMs have been shown to perform well in machine translation and they now rival task-specific MT systems on a number of benchmarks [50] and tasks [51]. In our case, we prompt the language models by providing a set of 5 randomly sampled English-to-Italian translations (and vice-versa for the Italian-to-English translation). Finally, we measure the translation performance of the models using COMET, a learned metric to assess the quality between an automatic translation and a gold ref-erence, as COMET has shown better correlation with human judgement than other metrics, such as BLEU.

Model

Table 9

Hyper-parameters used to fine-tune our models.

<author> <persName><forename type="first">A</forename><forename type="middle">Q</forename><surname>Jiang</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Sablayrolles</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Mensch</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Bamford</surname></persName> </author> <author> <persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Chaplot</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>De Las Casas</surname></persName> </author> <author> <persName><forename type="first">F</forename><surname>Bressand</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Lengyel</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Lample</surname></persName> </author> <author> <persName><forename type="first">L</forename><surname>Saulnier</surname></persName> </author> <author> <persName><forename type="first">L</forename><forename type="middle">R</forename><surname>Lavaud</surname></persName> </author> <author> <persName><forename type="first">M.-A</forename><surname>Lachaux</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Stock</surname></persName> </author> <author> <persName><forename type="first">T</forename><forename type="middle">L</forename><surname>Scao</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Lavril</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Wang</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Lacroix</surname></persName> </author> <author> <persName><forename type="first">W</forename><forename type="middle">E</forename><surname>Sayed</surname></persName> </author> <ptr target="https://arxiv.org/abs/2310.06825.arXiv:2310.06825" /> </analytic> <monogr> <title level="j">Mistral 7 2023 HTouvron TLavril GIzacard XMartinet M.-ALachaux TLacroix BRozière NGoyal EHambro FAzhar ARodriguez AJoulin EGrave GLample Llama: Open and efficient foundation language models 2023 HTouvron LMartin KStone PAlbert AAlmahairi YBabaei NBashlykov SBatra PBhargava SBhosale DBikel LBlecher CCFerrer MChen GCucurull DEsiobu JFernandes JFu WFu BFuller CGao VGoswami NGoyal AHartshorn SHosseini RHou HInan MKardas VKerkez MKhabsa IKloumann AKorenev PSKoura M.-ALachaux TLavril JLee DLiskovich YLu YMao XMartinet TMihaylov PMishra IMolybog YNie APoulton JReizenstein RRungta KSaladi ASchelten RSilva EMSmith RSubramanian XETan BTang RTaylor AWilliams JXKuan PXu ZYan IZarov YZhang AFan MKambadur SNarang ARodriguez RStojnic SEdunov TScialom Llama 2: Open foundation and fine-tuned chat models 2023 Llamantino: Llama 2 models for effective text generation in italian language PBasile EMusacchio MPolignano LSiciliani GFiameni GSemeraro 2023 MPolignano PBasile GSemeraro arXiv:2405.07101 Advanced natural-based interaction for the italian language: Llamantino-3-anita 2024 Efficient language model training through cross-lingual and progressive transfer learning MOstendorff GRehm arXiv:2301.09626 2023 arXiv preprint FOCUS: Effective embedding initialization for monolingual specialization of multilingual models KDobler GDe Melo 10.18653/v1/2023.emnlp-main.829 Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics HBouamor JPino KBali the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics

Singapore

2023 ZCsaki BLi JLi QXu PPawakapan LZhang YDu HZhao CHu UThakker arXiv:2404.05829 Sambalingo: Teaching large language models new languages 2024 arXiv preprint MFaysse PFernandes NGuerreiro ALoison DAlves CCorro NBoizard JAlves RRei PMartins arXiv:2402.00786 Croissantllm: A truly bilingual french-english language model 2024 arXiv preprint DanteLLM: Let's push Italian LLM research forward! ABacciu CCampagnano GTrappolini FSilvestri Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) NCalzolari M.-YKan VHoste ALenci SSakti NXue the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Torino, Italia

ELRA and ICCL 2024 ASantilli ERodolà arXiv:2307.16456 Camoscio: an italian instruction-tuned llama 2023 DGroeneveld IBeltagy PWalsh ABhagia RKinney OTafjord AHJha HIvison IMagnusson YWang SArora DAtkinson RAuthur KRChandu ACohan JDumas YElazar YGu JHessel TKhot WMerrill JMorrison NMuennighoff ANaik CNam MEPeters VPyatkin ARavichander DSchwenk SShah WSmith EStrubell NSubramani MWortsman PDasigi NLambert KRichardson LZettlemoyer JDodge KLo LSoldaini NASmith HHajishirzi Olmo: Accelerating the science of language models 2024 SMehta MHSekhavat QCao MHorton YJin CSun IMirzadeh MNajibi DBelenko PZatloukal MRastegari Openelm: An efficient language model family with open training and inference framework 2024 TNguyen CVNguyen VDLai HMan NTNgo FDernoncourt RARossi THNguyen arXiv:2309.09400 Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages 2023 TComputer Redpajama: an open dataset for training large language models 2023 GPenedo QMalartic DHesslow RCojocaru ACappelli HAlobeidli BPannier EAlmazrouei JLaunay arXiv:2306.01116 The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only 2023 arXiv preprint Croissantllm: A truly bilingual french-english language model MFaysse PFernandes NMGuerreiro ALoison DMAlves CCorro NBoizard JAlves RRei PHMartins ABCasademunt FYvon AF TMartins GViaud CHudelot PColombo 2024 The fineweb datasets: Decanting the web for the finest text data at scale GPenedo HKydlíček LB ALozhkov MMitchell CRaffel LVWerra TWolf 2024 PLiang RBommasani TLee DTsipras DSoylu MYasunaga YZhang DNarayanan YWu AKumar BNewman BYuan BYan CZhang CACosgrove CDManning CRe DAcosta-Navas DAHudson EZelikman EDurmus FLadhak FRong HRen HYao JWang KSanthanam LOrr LZheng MYuksekgonul MSuzgun NKim NGuha NSChatterji OKhattab PHenderson QHuang RAChi SMXie SSanturkar SGanguli THashimoto TIcard TZhang VChaudhary WWang XLi YMai YZhang YKoreeda Holistic evaluation of language models 2023 ALozhkov RLi LBAllal FCassano JLamy-Poirier NTazi ATang DPykhtar JLiu YWei TLiu MTian DKocetkov AZucker YBelkada ZWang QLiu DAbulkhanov IPaul ZLi W.-DLi MRisdal JLi JZhu TYZhuo EZheltonozhskii NO ODade WYu LKrauß NJain YSu XHe MDey EAbati YChai NMuennighoff XTang MOblokulov CAkiki MMarone CMou MMishra AGu BHui TDao AZebaze ODehaene NPatry CXu JMcauley HHu TScholak SPaquet JRobinson CJAnderson NChapados MPatwary NTajbakhsh YJernite CMFerrandis LZhang SHughes TWolf AGuha LWerra HVries arXiv:2402.19173 Starcoder 2 and the stack v2: The next generation 2024 What's in my big data? YElazar ABhagia IHMagnusson ARavichander DSchwenk ASuhr EPWalsh DGroeneveld LSoldaini SSingh HHajishirzi NASmith JDodge The Twelfth International Conference on Learning Representations 2024 web data, and web data only GPenedo QMalartic DHesslow RCojocaru ACappelli HAlobeidli BPannier EAlmazrouei JLaunay The refinedweb dataset for falcon llm: Outperforming curated corpora with 2023 JAinslie JLee-Thorp MJong YZemlyanskiy FLebrón SSanghai arXiv:2305.13245 Gqa: Training generalized multi-query transformer models from multihead checkpoints 2023 arXiv preprint RChild SGray ARadford ISutskever arXiv:1904.10509 Generating long sequences with sparse transformers 2019 arXiv preprint IBeltagy MEPeters ACohan arXiv:2004.05150 Longformer: The long-document transformer 2020 arXiv preprint cpp: Inference of meta's llama model (and others) in pure c/c++, ?? GGerganov FlashAttention-2: Faster attention with better parallelism and work partitioning TDao International Conference on Learning Representations (ICLR) 2024 Efficient memory management for large language model serving with pagedattention WKwon ZLi SZhuang YSheng LZheng CHYu JEGonzalez HZhang IStoica Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles the ACM SIGOPS 29th Symposium on Operating Systems Principles 2023 ILoshchilov FHutter arXiv:1711.05101 Decoupled weight decay regularization 2017 arXiv preprint LMoroni SConia FMartelli RNavigli ITA-Bench: Towards a more comprehensive evaluation for Italian LLMs CLiC-it 2024 PClark ICowhey OEtzioni TKhot ASabharwal CSchoenick OTafjord arXiv:1803.05457 Think you have solved question answering? try arc, the ai2 reasoning challenge 2018 arXiv preprint CClark KLee M.-WChang TKwiatkowski MCollins KToutanova arXiv:1905.10044 Boolq: Exploring the surprising difficulty of natural yes/no questions 2019 arXiv preprint KCobbe VKosaraju MBavarian MChen HJun LKaiser MPlappert JTworek JHilton RNakano arXiv:2110.14168 Training verifiers to solve math word problems 2021 arXiv preprint Hellaswag: Can a machine really finish your sentence? RZellers AHoltzman YBisk AFarhadi YChoi arXiv:1905.07830 2019 arXiv preprint Measuring massive multitask language understanding DHendrycks CBurns SBasart AZou MMazeika DSong JSteinhardt Proceedings of the International Conference on Learning Representations the International Conference on Learning Representations ICLR 2021 Piqa: Reasoning about physical commonsense in natural language YBisk RZellers JGao YChoi Proceedings of the AAAI conference on artificial intelligence the AAAI conference on artificial intelligence 2020 34 JWelbl NFLiu MGardner arXiv:1707.06209 Crowdsourcing multiple choice science questions 2017 arXiv preprint SLin JHilton OEvans arXiv:2109.07958 Truthfulqa: Measuring how models mimic human falsehoods 2021 arXiv preprint Winogrande: An adversarial winograd schema challenge at scale KSakaguchi RLBras CBhagavatula YChoi Communications of the ACM 64 2021 Disce aut deficere: Evaluating llms proficiency on the INVALSI Italian benchmark FMercorio MMezzanzanica DPotertì ASerino ASeveso 2024 Fewshot learning with multilingual generative language models XVLin TMihaylov MArtetxe TWang SChen DSimig MOtt NGoyal SBhosale JDu RPasunuru SShleifer PSKoura VChaudhary BO'horo JWang LZettlemoyer ZKozareva MDiab VStoyanov XLi 10.18653/v1/2022.emnlp-main.616 Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics YGoldberg ZKozareva YZhang the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics

Abu Dhabi, United Arab Emirates

2022 SMehta MHSekhavat QCao MHorton YJin CSun IMirzadeh MNajibi DBelenko PZatloukal MRastegari arXiv.org OpenELM: An Efficient Language Model Family with Open Training and Inference Framework 2024 GSarti MNissim arXiv:2203.03759 It5: Large-scale text-to-text pretraining for italian language understanding and generation 2022 arXiv preprint Two new datasets for italian-language abstractive text summarization NLandro IGallo RLa Grassa EFederici Information 13 228 2022 ZFu WLam QYu AM .-CSo SHu ZLiu NCollier arXiv:2304.04052 Decoder-only or encoder-decoder? interpreting language model as a regularized encoderdecoder 2023 arXiv preprint The unreasonable effectiveness of few-shot learning for machine translation XGarcia YBansal CCherry GFoster MKrikun MJohnson OFirat International Conference on Machine Learning

PMLR

2023 The flores-101 evaluation benchmark for low-resource and multilingual machine translation NGoyal CGao VChaudhary P.-JChen GWenzek DJu SKrishnan MRanzato FGuzmán AFan Transactions of the Association for Computational Linguistics 10 2022 BZhang PWilliams ITitov RSennrich arXiv:2004.11867 Improving massively multilingual neural machine translation and zero-shot translation 2020 arXiv preprint Biases in large language models: Origins, inventory, and discussion RNavigli SConia BRoss 10.1145/3597307 J. Data and Information Quality 15 2023 Increasing coverage and precision of textual information in multilingual knowledge graphs SConia MLi DLee UMinhas IIlyas YLi 10.18653/v1/2023.emnlp-main.100 Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing Association for Computational Linguistics HBouamor JPino KBali the 2023 Conference on Empirical Methods in Natural Language Processing

Singapore

2023 Towards cross-cultural machine translation with retrieval-augmented generation from multilingual knowledge graphs SConia DLee MLi UFMinhas SPotdar YLi Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing the 2024 Conference on Empirical Methods in Natural Language Processing

Miami, Florida, USA

Association for Computational Linguistics 2024