1. Introduction

Recipe

What we Learned from Continually Training Minerva: a Case Study on Italian

Luca Moroni

Tommaso Bonomo

Luca Giofré

Lu Xu

Domenico Fedele

Leonardo Colosi

Andrei Stefan Bejgu

Alessandro Scirè

Roberto Navigli

0 1 0 Babelscape , Rome , Italy 1 Sapienza NLP Group, Dip. di Ingegneria Informatica, Automatica e Gestionale, Sapienza University of Rome , Rome , Italy

2025

3 16

Modern Large Language Models (LLMs) are commonly trained through a multi-stage pipeline encompassing pretraining and supervised finetuning. While recent studies have extensively investigated the benefits of continual pretraining on high-quality data, these eforts have focused primarily on English. In this work, we explore the efectiveness of various data mixtures in a continual pretraining setting to enhance performance on Italian-language tasks. Leveraging Minerva-7B, a fully opensource LLM pretrained on a corpus composed of 50% Italian, we define and evaluate three distinct data recipes-comprising mathematical, encyclopedic, and copyrighted content-spanning both Italian and English. We also investigate the efect of extending the model's context window during continual pretraining on its ability to handle long-context tasks. To support our evaluation, we introduce INDAQA, a new benchmark for narrative question answering in Italian. Our results reveal that both data composition and increased context length substantially improve performance, ofering valuable insights into continual pretraining strategies for less represented languages within an open scientific framework.

eol>Large Language Models Italian Continual Pre-training Culturality Long Context

1. Introduction

answering or summarization) or, more frequently, aim at training general-purpose conversational models. This is Modern Large Language Models (LLMs) are typically achieved by finetuning LLMs on hundreds of thousands trained through a multi-stage process comprising pre- of conversations covering diverse domains. Through this training, supervised fine-tuning (SFT), and preference process, models learn to follow instructions to perform alignment. During pretraining, models are trained in an a wide range of tasks [7, 8, 9] and generate coherent autoregressive manner to learn language in an unsuper- responses in dialogue-like interactions. vised way, without requiring human-labeled data [1, 2]. While the overall LLM training pipeline has become This phase allows models to acquire linguistic knowl- increasingly standardized, the role of curated data after edge from large-scale, unstructured corpora. Recent ap- initial pretraining remains an active area of investigation proaches [3, 4, 5, 6] structure the pretraining process into for further improving model capabilities. However, the two steps. In the first, models are exposed to trillions of efects of continual training on curated data mixtures reraw web-sourced tokens, with only a small portion of main poorly understood, particularly for less represented high-quality content. In the second, training continues on languages such as Italian. To the best of our knowledge, a curated set of high-quality language or domain-specific OLMo et al. [3] is the only work specifically addressing texts, aiming to mitigate the impact of low-quality web the impact of data composition in an open-source setting; content and extend the model’s exposure to up-to-date however, it is limited to the English language. and informative content. In this work, we address this gap by systematically in

After the intensive pretraining phase—where LLMs are vestigating how incorporating high-quality data mixtures trained solely on unlabeled data—models undergo super- during continual pretraining afects model performance vised fine-tuning to adapt to real-world use cases. SFT on English- and Italian-language tasks. A particular focan target either task-specific applications (e.g., question cus is placed on cultural knowledge evaluation, where curated data is expected to play a crucial role in enrichCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- ing the model’s ability to answer questions about Italian *tiCcso,rSreepstpeomnbdeirng24a—uth2o6,r.2025, Cagliari, Italy cultural content. To this end, we build on the Minerva-7B $ moroni@diag.uniroma1.it (L. Moroni); base model [10], a fully open-source LLM pretrained on bonomo@diag.uniroma1.it (T. Bonomo); giofre@diag.uniroma1.it a balanced corpus of Italian and English data (50% each), (L. Giofré); xu@diag.uniroma1.it (L. Xu); fedele@babelscape.com which provides a suitable foundation for evaluating bilin(D. Fedele); colosi@babelscape.com (L. Colosi); gual continual pretraining strategies. (bAej.gSuc@irèb)a;bnealvscigalpi@e.cdoimag.(uAn.iSr.oBmeaj1g.uit);(Rsc.irNea@vibgalbi)elscape.com Specifically, we define three distinct high-quality data © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License recipes for continual pretraining, varying in data dimenAttribution 4.0 International (CC BY 4.0). sions and source types, using both Italian and English Context Length Manipulation. Large Language texts. These include content rich in mathematical rea- Models are typically pretrained with a fixed maximum soning, encyclopedic knowledge, and copyrighted books. context length, which limits the number of tokens they Through ablation studies, we examine the individual con- can process in a single sequence. Recent work by Xiong tribution of specific data sources—such as copyrighted et al. [12] demonstrates how expanding the context material and mathematical content—on downstream per- length of Llama-2 models–from 4,096 to 32,728 tokens– formance across English and Italian benchmarks. can improve performance on long-context tasks. A crit

Additionally, we explore the efect of extending the ical aspect of long-context training is the choice of pomodel’s maximum context length during continual pre- sitional encoding. Most modern LLMs employ Rotary training, aiming to assess its impact on long-context un- Positional Embeddings (RoPE) [13], which encode token derstanding. After pretraining, we instruction-tune the positions by rotating the query and key vectors in attenvarious model variants using a bilingual (English and tion layers. This approach maintains relative positional Italian) instruction-following dataset to evaluate their information and can be adapted for longer sequences. performance in conversational settings. Recent studies show that modifying the RoPE base fre

Finally, to properly evaluate the influence of longer quency during continual pretraining enables models to context and data composition, we introduce INDAQA, a handle longer contexts and even extrapolate beyond the novel Italian benchmark for narrative question answering trained sequence lengths [14, 15]. Building on these find(Section 6.1). Using INDAQA, we demonstrate the bene- ings, several recent LLMs have been released with exifts of longer context windows and specific high-quality tended context capabilities. For example, Grattafiori et al. data sources for complex language understanding tasks. [4] increases the context length of Llama-3 models from 8,192 to 128,000 tokens in the final stages of pretraining.

Similarly, the Qwen model family [16] mostly supports 2. Related Work contexts up to 32,000 tokens. However, despite these advancements, to the best of our knowledge, this paper is the first that systematically investigates the impact of context length manipulation on Italian-language tasks.

Continual Training. Following the initial pretraining

phase over trillions of tokens, it is now common practice to introduce high-quality data in a subsequent training stage to further enhance LLM performance and steer the model’s distribution toward more controlled domains. Recent research has increasingly focused on continual pretraining as a practical and impactful approach. For instance, OLMo et al. [3] and Grattafiori et al. [4] introduce a mid-training stage that incorporates high-quality datasets into the pretraining process, e.g. GSM8K training set for mathematical reasoning. This stage is treated as a continuation of the initial training, employing an annealing learning rate that decays linearly to zero. This approach has been shown to improve downstream performance in tasks requiring structured reasoning and encyclopedic knowledge recall.

Continual training is also frequently employed to adapt released open-weight LLMs to specific languages or domains, thereby improving performance on targeted tasks. Basile et al. [11] and others demonstrate that adapting pretrained multilingual models to Italian using curated high-quality data leads to significant improvements in Italian-language benchmarks. Despite these advances, there is still a lack of systematic studies that ablate and isolate the specific contributions of diferent data mixing strategies in the continual pretraining stage—particularly for less represented languages like Italian. In our work, we assess the impact of controlled data used in the continual-pretraining stage, looking at their impact on English and Italian performance.

Evaluation of LLMs in Italian. Several recent eforts

aim to close the evaluation gap between English and Italian for generative LLMs. One of the first initiatives, Ita-Bench [17], combines translated benchmarks with natively authored Italian tasks, focusing on instructionfollowing and question answering. Along the same lines, Magnini et al. [18] reframes native Italian resources into both multiple-choice and open-ended formats, studying the role of prompting strategies. More recently, ITALIC [19] introduces a multiple-choice question answering dataset entirely written in Italian, covering linguistic, cultural, and domain-specific knowledge. In parallel, Puccetti et al. [20] adapts Invalsi assessments to probe LLMs’ multi-domain abilities.

Complementing these Italian-specific eforts, multilingual benchmarks have also emerged. GlobalMMLU [21] extends MMLU to multiple languages via professional translation and cultural adaptation, while MultiLOKO [22] provides culturally grounded questions authored directly in each target language, including Italian. While these benchmarks cover a variety of linguistic and cultural aspects, they primarily focus on short-form tasks. Yet, many real-world scenarios, such as narrative comprehension and document-level reasoning, require models to process and integrate information across longer contexts. However, evaluation resources in Italian remain limited in this dimension. To fill this gap, we introduce INDAQA (Section 6.1), the first narrative question answering benchmark designed to evaluate long-context comprehension in Italian.

3. Methodology This work investigates the impact of continual training

and the influence of diferent data sources on downstream performance, with particular attention to copyrighted material. Additionally, we aim to address a gap in the literature regarding the efect of context length expansion on performance in Italian.

We focus on three key dimensions: • Data recipes: we introduce three distinct recipes designed to evaluate the role of data composition during continual training. • Context length: we describe how we adapt models to long-context scenarios, using a selected data mixture from the previous step. • Instruction following: we examine the instruction-following capabilities developed on top of each training recipe. Benchmarks Wikipedia RedPajama Benchmarks Wikipedia Fineweb-edu Gutenberg FLAN The Stack Recipe-1 dataset with additional sources. For Italian, we included 3.1. Data Recipes for Wide Linguistic the Wikisource4 collection of articles, Gazzetta Uficiale, 5 Coverage which contains legislative and administrative acts of the Italian State, and Project Gutenberg. For English, we To evaluate the impact of various data sources on the con- incorporated subsets of the Dolmino-mix dataset, used tinual training of an open-source LLM, namely Minerva- in the continual training of OLMo-2 [3], specifically the 7B base model, we define several data recipes, each rep- MATH and StackExchange (SE) components. resenting a distinct mixture of training corpora. Table 1 The key distinction between Recipe-2 and Recipe-3 presents the data composition for one such configuration, is that Recipe-3 incorporates the Books3 dataset [30], which we refer to as Recipe-11. This recipe incorporates which allows the impact of including closed-copyrighted a diverse set of sources. For Italian, we include: the Ital- book content to be quantified. Further details on our data ian Wikipedia (Hugging Face version, 2023 dump, Italian preprocessing steps can be found in Appendix B. split)2 encyclopedic collection of text, RedPajama [23], a web-based collection, and Ita-Bench [17], a suite of Italian 3.2. Long-context Adaptation and English benchmarks for generative models (Italian training split). Regarding English, the dataset comprises: Recent studies demonstrate that continual pre-training Wikipedia (English split), Ita-Bench (English training can substantially extend the context length of LLMs [12, split), Fineweb-edu [24], a web-based collection, Project 31]. Based on previous work and motivated by the lack of Gutenberg,3 which comprises public-domain books, and a proper assessment of context expansion in Italian, we FLAN [25, 26, 27, 28, 29], which contains diferent in- carry out the context length expansion on Recipe-3, our structions for mathematical and logical reasoning. continually pre-trained model described in Section 3.1.

Building on Recipe-1, we design two additional data Following the methodology of Xiong et al. [12], we extend mixtures, Recipe-2 and Recipe-3, to evaluate the im- the maximum context length from 4,096 tokens (the origipact of mathematical reasoning data and the inclusion of nal limit of Minerva-7B) to 16,384 tokens. This expansion a large volume of copyrighted books. Table 2 shows requires adjusting the Rotary Position Embedding (RoPE) the data composition for these two recipes. Starting base frequency from 10,000 to 500,000 to accommofrom the foundation of Recipe-1, we replace the standard date the increased sequence length. To establish baseline Wikipedia dump with a curated and cleaned version col- comparisons, we adjust the RoPE base frequency in our lected by us, updated to May 2024. We also expanded the continually-trained models obtained through the recipes of Section 3.1 in order to adapt them to longer contexts.

1Recipe-1 corresponds to the continual pretraining data used in the

ifrst version of the released Minerva-7B. 2https://huggingface.co/datasets/wikimedia/wikipedia 3https://huggingface.co/datasets/manu/project_gutenberg

4https://huggingface.co/datasets/wikimedia/wikisource 5https://huggingface.co/datasets/mii-llm/gazzetta-uficiale Dataset

TÜLU-v3 LIMA WildChat-IT TowerBlocks-v0.2 GPT-4o-ITA-Instruct Aya

EN IT/EN

IT IT/EN

IT IT

4. Experimental setup

4.1. Continual training

After continual pre-training, each recipe is converted into an instruct model through an SFT stage on the dialogue mixture summarised in Table 3. We base the mixture on TÜLU-v3 [9], a popular open-source 940Kconversation corpus covering 85 task families (reasoning, code, function-calling, safety, tool use, etc.) mined from public APIs and manually filtered for policy compliance, which provides the broad, structured competence expected of modern assistants. To inject high-signal, stylistically polished examples we add the 1000-turn LIMA dataset [8] and its Italian counterpart LIMA-IT, produced by us by translating every prompt/response pair with GPT-4o-mini under a fidelity-preserving prompt; this gives the model a high-quality set of concise, helpful dialogue in both languages. We expand our selection with additional Italian-centric datasets: i) WildChat-IT, consisting of 5K informal prompts; ii) TowerBlocks-v0.2, containing 7K bilingual it-en public-service Q&A pairs; iii) GPT-4o-ITA-Instruct, with 15K high-quality synthetic chain-of-thought examples; and iv) Aya, which includes 700 role-play and reasoning turns, specifically targeting colloquial language, public administration knowledge, and culturally grounded reasoning. 4.2. Instruction finetuning

Supervised fine-tuning was carried out with the

LLAMA-Factory8 toolkit, which supports several conversation templates and provides utilities for eficient data parallelization. We fine-tuned the full Minerva-7B weights (no LoRA/adapters) in bfloat 16 mixed precision. Training lasted two epochs with a peak learning rate of 1 × 10− 6 scheduled by cosine decay after a 10% warm-up, and AdamW as the optimizer. We used an 6https://github.com/mosaicml/llm-foundry 7https://www.hpc.cineca.it/systems/hardware/leonardo/ 8https://github.com/hiyouga/Llama-Factory efective batch of 64 sequences (≈ 128 tokens). All models were trained with a 4096-token context window, except the long-context variant of Recipe-3, which retained its 16384-token window. End-to-end, each recipe consumed about 210 GPU-hours (240 for the long-context run). Detailed timing and CO2 estimates are shown in Appendix A.

5. Evaluation

5.1. Language Modeling by Genre

To evaluate the impact of the diferent data recipes, we 5.2. Multi-Choice Question Answering analyze perplexity scores of trained LLMs on held-out data from various genres. Specifically, we test the models To properly assess how diferent continual pretrainon three distinct genres: Books, Wikipedia, and News. ing recipes influence LLM capabilities, we evaluate

The Books set consists of 51 held-out books selected our trained models on a range of Italian-language from Books3 [30], covering 25 diferent genres, in En- benchmarks. In this Section, we focus exclusively glish languages. The Wikipedia set includes 50 Italian on the continually-trained models, before applying pages from a 2025 snapshot9, excluded from the training any instruction tuning. This approach isolates the efdata used in all recipes. The News set consists of 200 fects of continual pretraining and avoids biases introItalian newspaper articles we independently collected duced by SFT data. We conduct evaluations using the from 2025 publications, ensuring they were never seen LM-Evaluation-Harness [33] library, leveraging the during any training step. Table 4 reports the language multi-choice format: a model’s next-token prediction is modeling performance, measured by perplexity, across used to assess its QA ability. these domains for each trained model. We evaluate the models using ITA-Bench [17], select

Regarding Books, incorporating Books3 into the train- ing a diverse set of tasks from the benchmark: AMI ing mix significantly lowers perplexity, as seen in the (Misogyny Detection), GhigliottinAI (GH; a culturally improved performance of Recipe-3. This indicates that in- grounded game), NERMUD (Named Entity Recognition), cluding in-domain book content enhances generalization Prelearn (PL; Prerequisite Learning), ARC (Scientific Reato literary-style text. Additionally, testing Recipe-316K soning), BoolQ (BQ; Boolean Questions), GSM8K (Mathusing 16k context on Books drops the perplexity to 8.98, ematics), HellaSwag (HS; Textual Entailment), MMLU further improving modeling on extended sequences. (Multi-domain QA), PIQA (Physical Interaction QA), and

For the Wikipedia genre, all three recipes outperform SCIQ (Science Questions). For AMI, GhigliottinAI, and the original pretrained model, demonstrating improved NERMUD, we use ITA-Bench’s cloze-style evaluation ability to model high-quality encyclopedic text. Notably, format.

Recipe-2 and Recipe-3 achieve the lowest perplexity, sug- Table 5 shows that all recipes of continual pretraining gesting benefits from training on more recent and cleaner consistently improve over the pretrained model, with Wikipedia texts. an average gain of approximately +5.0 points. This re

In contrast, for the News genre, perplexity diferences sult reinforces the importance of continual pretraining among the recipes are minimal (±0.20), indicating a lim- on high-quality (e.g., Wikipedia, Fineweb-edu) and synited impact of the training data variations on this domain. thetic datasets (e.g., FLAN, Dolmino-MATH subset). NoInterestingly, the base model achieves the lowest perplex- tably, MMLU exhibits substantial improvements across ity. all recipes (≈ +15 points), highlighting strong generalization on multi-domain QA tasks. The best average Bottom line: The modeling of literary-style texts and performance is achieved by Recipe-2 and the long-context Wikipedia articles is influenced by the choice of contin- variant of Recipe-3. Recipe-1 underperforms, particularly ual pretraining strategies, whereas News articles show no on math-related benchmarks such as ARC and GSM8K, diferences. indicating the critical role of domain-specific data (e.g., Dolmino-MATH) in boosting model capabilities.

9We process the May 1st, 2025 Wikipedia dump by first discarding

pages with fewer than 500 tokens, and then sampling uniformly at random from the resulting set.

Bottom line: Continual pretraining consistently boosts downstream performance; mathematical data improves

STEM QA, while copyrighted books have minimal impact. 5.3. Mathematical Evaluation To assess the impact of diferent continual-pretraining Recipe-1 2.48 14.70 recipes on math capabilities, we rely on two widely Recipe-2 9.57 34.42 used English mathematical benchmarks: GSM8k [34] and Recipe-3 8.96 26.45 MATH [35]. The former contains grade school math word Recipe-316K 10.26 32.29 problems, while the latter comprises challenging compe- Minerva Instruct Models tition mathematics problems. We evaluate our models Recipe-1 10.14 24.63 using the LM-Evaluation-Harness [33], using its im- Recipe-2 12.84 42.45 plementations of both benchmarks. For GSM8k, we adopt Recipe-3 13.00 37.98 an 8-shot Chain-of-Thought prompting setup, while for Recipe-316K 12.82 40.25 MATH, we follow the Minerva-MATH [36] protocol, us- Italian-specific Models ing 4-shot Chain-of-Thought prompting. Both benchmarks use the generate_until setup, with model out- AONccIiTgAlo-8t-B7b 1170..5866 6409..6858 puts evaluated via post-processing for accuracy. We compare our recipes to diferent open-source Italian (occiglot- English-first Models 7b-it-en-instruct10, ANITA-8B [37]) and multilingual Llama-3.1-8B 41.94 80.66 (Llama-3.1-8B [4], Mistral-7B [38], Qwen3-8B [39]) mod- Mistral-7B-v0.3 13.92 53.22 els, all in the same parameter range. Qwen3-8B 65.00 87.86

Table 6 presents the results of tested models, with our Table 6 four continually pre-trained Minerva models evaluated Mathematical evaluation results on diferent Minerva conboth before and after instruction tuning. On GSM8k, tinual pre-training recipes (before and after instruction fineRecipe-2 achieves the highest accuracy in both settings, tuning) and State-of-the-Art models on Minerva-MATH (4followed by Recipe-3, while Recipe-1 consistently un- shot) with sub-categories, and GSM8k (8-shot). derperforms. Instruction tuning yields consistent improvements across all recipes, reinforcing the overall ranking and demonstrating its positive efect. These Bottom line: Continual pretraining on mathematical ifndings suggest that incorporating mathematical data, data consistently improves accuracy on math problems. such as Dolmino-MATH, during continual pre-training Instruction tuning on TULU-v3 helps mitigate the shortplays a significant role in enhancing mathematical rea- comings of Recipe-1 on the MATH benchmark. soning. For the MATH dataset, Recipes 2 and 3 outperform Recipe-1 in the base (pre-instruction tuning) setting, 5.4. Cultural Evaluation particularly benefiting from long-context capabilities. Interestingly, after instruction tuning, the performance gap We assess the impact of our recipes used during continual narrows, with Recipe-1 becoming more competitive. pre-training by leveraging the Italian part of the Multi

When comparing Minerva models to state-of-the-art loko [22] dataset (250 instances), which provides quessystems on GSM8k, they lag behind closed-data models tions on cultural content along with multiple acceptable in both Italian and English. On the MATH dataset, Min- answers. We then compare our continually pre-trained erva is comparable to Occiglot and Mistral, two closed- and instruction finetuned Minerva models to other Italian data models, but still lags behind top-performing English- and English models, as in the previous section. centric systems. This highlights the perfomance gap that According to the results in Table 7, Recipe-1 is the Italian open-data LLMs must bridge. best performing model, both in Zero- and Few-Shot settings, surpassing both the Italian-specific and the English10https://huggingface.co/occiglot/occiglot-7b-it-en-instruct centric counterparts.

6. Long-context Evaluation on Narrative Text

To evaluate the long-context capabilities of our model, we focus on narrative question answering, a task that requires the processing and understanding of extensive narrative text in order to answer questions. NarrativeQA [41], a widespread benchmark for this task, was constructed in English, which limits its use for the evaluation of long-context performance in other languages. To address this limitation, we introduce INDAQA (Section 6.1), a novel benchmark for Italian narrative question answering, and, to the best of our knowledge, the first narrative question answering dataset in Italian. We describe the evaluation setup for base and instruction-tuned models on both NarrativeQA and INDAQA in Section 6.2 and report the results in Section 6.3. 6.1. INDAQA - Italian Narrative DAtaset for Question Answering

Recipe-2 and Recipe-3, which are trained on a large

amount of mathematics, code, and English-copyrighted We start building the dataset from the Italian split of books, do not show the same cultural alignment in the Echoes from Alexandria [42], collecting 365 (book, sumMultiLoKo Italian set. This observation demonstrates mary) pairs with full texts from Wikisource and sumthat synthetic, mathematical, and English literary data maries from Wikipedia. After manually verifying aligncan be detrimental for Italian cultural alignment. ment and removing plot-unrelated content from sum

Recently, Seveso et al. [19] have shown that Italian- maries, we prompt an LLM11 to generate 20 questionifrst models perform consistently lower than English-first answer pairs per book using the following guidelines: (i) ones on the ITALIC dataset. We hypothesize that the questions must be unique, (ii) questions must be clear, multiple choice format could be particularly problem- unambiguous, and answerable from the summary alone, atic and might obscure the cultural knowledge recall of and (iii) each question requires having two short, potenlanguage models. Therefore, we examine whether these tially diferent, reference answers. results hold when reframing ITALIC in an open-ended After gathering a large number of samples, we filter setting, which better reflects potential use cases for gen- them through three sequential steps. First, we deduerative models. Details on how we reframed the dataset, plicate questions, but rather than discarding duplicates ITALIC-GEN, are in Appendix D. entirely, we retain all unique answers as additional ref

We use METEOR [40] to evaluate the performance, erences for the remaining samples. We also preserve as only one reference answer per question is available, diferent reformulations of identical questions, as Narraand standard string matching metrics, such as EM, may tiveQA contains similar variations. Second, we remove struggle when model outputs and references difer signif- unanswerable questions, i.e., samples containing invalid icantly in phrasing and/or length. The results in Table 7 responses such as "Information not present in the sumconfirm the trend seen in MultiLoKo, which again demon- mary." Finally, we filter out meta-questions that focus on strates the cultural alignment capacity of Minerva models. structural rather than plot elements (e.g., "What happens Our results further suggest that incorporating structured in chapter 3?" or "What is the title of the book?"). The last mathematical data during pretraining can constrain a two filtering steps are carried out through a set of manumodel’s acquisition of cultural knowledge. ally derived RegEx patterns. Examples of samples that were filtered out are showcased in Table 11 (Appendix).

Bottom line: Multiple-choice QA may not be well suited We reduce the average answer length so as to be betfor evaluating cultural competence, as it limits expressive ter aligned with NarrativeQA by employing an LLM to freedom and fails to capture the nuanced reasoning required shorten the replies. We perform this step only for the for culturally-grounded responses. Notably, Italian-native samples having no reference answers with less than 5 models emerge to be the most aligned with Italian culture, tokens. The final statistics on the QA length are prehighlighting the importance of language-specific pretrain- sented in Table 8. We manually validate generation and ing.

11We use Gemini-2.0-Flash and Gemini-2.0-Flash-Lite.

Metric Avg. Length (Tokens) # Samples

Question 1st Answer 2nd Answer Question 1st Answer 2nd Answer 3rd Answer 4th Answer 5th Answer ifltering steps on 17 documents (646 QA samples, 5% of the dataset) spanning diverse summary lengths (18-1200 tokens). Each sample is annotated for acceptability using the same criteria used for generation, yielding a 2.32% error rate after filtering.

Our final dataset, INDAQA, consists of texts with an average length shorter than NarrativeQA (27k vs 47k tokens) due to the prevalence of short stories and theatrical plays.12 The size of the two datasets is comparable (365 vs 355 documents) with slightly more average QA samples in INDAQA (37.83 vs 29.74). We also report the type of questions in the dataset by analyzing the first few tokens of the questions in Table 10 (Appendix). More details can be found in Appendix C.

Instruction-tuning evaluation We evaluate the instruction-tuned versions of the Minerva continual pretrained models alongside various systems, as in previous sections. Benchmarking is conducted on both NarrativeQA and INDAQA to assess real-world performance in English and Italian narrative question answering. We report METEOR [40] scores to measure answer quality against the reference responses. We truncate the book 6.2. Long-context Evaluation Setup texts to 16,384 and 32,768 tokens to match our target context lengths, following the approach used in LongBase-model evaluation To evaluate the efectiveness Bench [43]. While some questions may require context of our long-context continual training approach, we com- that is excluded by this truncation, all models are afected pare Recipe-316K against Recipe-1, Recipe-2 and Recipe-3. equally, ensuring a fair comparison between them. Except for Recipe-316K, we adapt each model to process longer sequences by tuning the RoPE base frequency to 6.3. Results = 100,000. We assess each model’s ability to utilize extended local context using an adapted version of Narra- In Figure 1 we present the results of our base-model evaltiveQA and INDAQA. Specifically, we truncate each text uation. Our long-context adaptation of Recipe-3 clearly at varying target context lengths (4,096, 8,192, 16,384 and enables the model to achieve a lower perplexity on the an32,768 tokens), and we record the minimum perplexity swers of NarrativeQA and INDAQA at all context lengths achieved by each model across the ground-truth answers tested, indicating an efective adaptation to long data. It is when given the truncated text and respective questions. especially interesting to note the results at 32,768 tokens: We assume that models efectively processing long con- adapting models continually trained with shorter context texts will show lower perplexity on correct answers than lengths through RoPE frequency tuning is not enough to those struggling with extended documents. avoid huge spikes in perplexity, while Recipe-316K is able to efectively model text at double its continual training context window. 12In our experiments, the input text is always truncated at 16k tokens.

4K 4K 4K 16K 32K 8K 128K 32K 4K 4K 4K 16K Minerva models 32K tokens of context, it ranks second only to Llama-3.1 and Mistral-v0.3, scoring 3.3 and 1.7 points lower respectively on the METEOR metric. This performance gap is expected, given that Recipe-316K’s continual training was conducted at half the context length (16K tokens).

Bottom line: Extending context length to 16K tokens via continual pre-training improves modeling capabilities over training-free methods and enhances robustness at 32K tokens. Recipe-316K achieves strong narrative QA performance in both English and Italian, outperforming Italian-specific models and matching English-first LLMs.

7. Conclusion This work explores the impact of data mixing strategies

and long-context expansion on Italian language modeling.

We conduct continual pretraining using three distinct A Italian-specific Models data recipes and apply a unified instruction-following ADQ occiglot-7b 32K 19.9 19.9 ifne-tuning approach to all resulting models. Our evaluIN ANITA-8B 8K 7.5 7.0 ation assesses language modeling capabilities on genrespecific data, highlighting that copyrighted books in

English-first Models cluded in the training recipes reduce perplexity on litLlama-3.1-8B 128K 24.9 29.3 erary texts. We benchmark the proposed continual preMistral-7B-v0.3 32K 22.5 27.7 training recipes across several multi-domain tasks, with Table 9 a focus on mathematical reasoning, demonstrating that Continual pre-training recipe evaluation on NarrativeQA and genre-specific data, such as mathematical texts and highINDAQA after instruction fine-tuning. M@16k and M@32k quality web content contribute to overall performance denote METEOR scores with 16,384 and 32,768 token book improvements, whereas copyrighted books do not consiscontexts. Bold scores indicate best overall performance; un- tently ofer the same benefit. We also investigate cultural derlined scores indicate best Italian-specific model. alignment, finding that English datasets, such as mathematical texts and English-copyrighted books, can negatively impact performance on culturally-aware Italian

Table 9 presents the results of the evaluation of our specific tasks. Additionally, our ITALIC-GEN adaptation instruction-tuned models. As expected, Recipe-316K ofers a complementary perspective on cultural evaluaachieves higher results on all settings, surpassing Recipe- tion, uncovering encouraging results for Italian LLMs. 1 on all experiments with books truncated to 16k tokens Lastly, we evaluate long-context capabilities through narby 7.7 points on NarrativeQA and 8.6 on INDAQA. The rative question answering in both English and Italian. diference is even larger when we extend the truncation Due to the absence of an Italian benchmark, we introof books to 32K tokens, with Recipe-316K achieving 17.3 duced INDAQA, a new dataset for Italian narrative QA, and 14.9 more METEOR points in NarrativeQA and IN- and show that extending the context length of a model DAQA, respectively. consistently improves its downstream performance on

Minerva models perform comparably to other models narrative QA. of the same size, both Italian-specific (occiglot-7b-it-eninstruct13, ANITA-8B [37]) and multilingual (Llama-3.1- Acknowledgments 8B [4], Mistral-7B [38]). On NarrativeQA, the Recipe-316K variant achieves a METEOR score of 21.4 and 20.5 at a context length of 16K and 32K respectively, ranking behind Llama-3.1 and Mistral-v0.3. In contrast, the Minerva model continually pre-trained with Recipe-316K outperforms all tested models on INDAQA at 16K tokens of context, achieving the highest METEOR score of 25.9. At

Luca Moroni, Tommaso Bonomo, Luca Giofré and Lu

Xu gratefully acknowledge the support of the AI Factory IT4LIA project. Roberto Navigli acknowledges the support of the PNRR MUR project PE0000013-FAIR. We acknowledge ISCRA for awarding this project access to the LEONARDO supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CINECA (Italy). 13https://huggingface.co/occiglot/occiglot-7b-it-en-instruct //aclanthology.org/2024.findings-acl.606/. doi: 10. 18653/v1/2024.findings-acl.606. [1] J. Hofmann, S. Borgeaud, A. Mensch, [6] L. Moroni, G. Puccetti, P.-L. Huguet Cabot, A. S.

E. Buchatskaya, T. Cai, E. Rutherford, Bejgu, A. Miaschi, E. Barba, F. Dell’Orletta, A. Esuli, D. de Las Casas, L. A. Hendricks, J. Welbl, R. Navigli, Optimizing LLMs for Italian: Reducing A. Clark, T. Hennigan, E. Noland, K. Millican, token fertility and enhancing eficiency through G. van den Driessche, B. Damoc, A. Guy, S. Osin- vocabulary adaptation, in: L. Chiruzzo, A. Ritdero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, ter, L. Wang (Eds.), Findings of the Association for L. Sifre, Training compute-optimal large language Computational Linguistics: NAACL 2025, Associamodels, in: Proceedings of the 36th International tion for Computational Linguistics, Albuquerque, Conference on Neural Information Processing New Mexico, 2025, pp. 6646–6660. URL: https:// Systems, NIPS ’22, Curran Associates Inc., Red aclanthology.org/2025.findings-naacl.371/.

Hook, NY, USA, 2022. [7] N. Ding, Y. Chen, B. Xu, Y. Qin, S. Hu, Z. Liu, [2] D. Groeneveld, I. Beltagy, E. Walsh, A. Bhagia, M. Sun, B. Zhou, Enhancing chat language modR. Kinney, O. Tafjord, A. Jha, H. Ivison, I. Mag- els by scaling high-quality instructional convernusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, sations, in: H. Bouamor, J. Pino, K. Bali (Eds.), K. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, Proceedings of the 2023 Conference on Empirical J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muen- Methods in Natural Language Processing, Assonighof, A. Naik, C. Nam, M. Peters, V. Pyatkin, ciation for Computational Linguistics, Singapore, A. Ravichander, D. Schwenk, S. Shah, W. Smith, 2023, pp. 3029–3051. URL: https://aclanthology.org/ E. Strubell, N. Subramani, M. Wortsman, P. Dasigi, 2023.emnlp-main.183/. doi:10.18653/v1/2023. N. Lambert, K. Richardson, L. Zettlemoyer, J. Dodge, emnlp-main.183.

K. Lo, L. Soldaini, N. Smith, H. Hajishirzi, OLMo: [8] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, Accelerating the science of language models, in: A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L.-W. Ku, A. Martins, V. Srikumar (Eds.), Proceed- L. Zettlemoyer, O. Levy, Lima: Less is more for ings of the 62nd Annual Meeting of the Associa- alignment, 2023. URL: https://arxiv.org/abs/2305. tion for Computational Linguistics (Volume 1: Long 11206. arXiv:2305.11206.

Papers), Association for Computational Linguis- [9] N. Lambert, J. Morrison, V. Pyatkin, S. Huang, tics, Bangkok, Thailand, 2024, pp. 15789–15809. H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, URL: https://aclanthology.org/2024.acl-long.841/. N. Dziri, S. Lyu, Y. Gu, S. Malik, V. Graf, J. D. doi:10.18653/v1/2024.acl-long.841. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wil[3] T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, helm, L. Soldaini, N. A. Smith, Y. Wang, P. Dasigi, S. Arora, A. Bhagia, Y. Gu, S. Huang, M. Jordan, H. Hajishirzi, Tulu 3: Pushing frontiers in open N. Lambert, D. Schwenk, O. Tafjord, T. Ander- language model post-training, 2025. URL: https: son, D. Atkinson, F. Brahman, C. Clark, P. Dasigi, //arxiv.org/abs/2411.15124. arXiv:2411.15124. N. Dziri, M. Guerquin, H. Ivison, P. W. Koh, J. Liu, [10] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S. CoS. Malik, W. Merrill, L. J. V. Miranda, J. Morri- nia, E. Barba, S. Orlandini, G. Fiameni, R. Navson, T. Murray, C. Nam, V. Pyatkin, A. Rangapur, igli, Minerva LLMs: The first family of large M. Schmitz, S. Skjonsberg, D. Wadden, C. Wilhelm, language models trained from scratch on Italian M. Wilson, L. Zettlemoyer, A. Farhadi, N. A. Smith, data, in: F. Dell’Orletta, A. Lenci, S. Montemagni, H. Hajishirzi, 2 olmo 2 furious, 2025. URL: https: R. Sprugnoli (Eds.), Proceedings of the 10th Italian //arxiv.org/abs/2501.00656. arXiv:2501.00656. Conference on Computational Linguistics (CLiC[4] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, it 2024), CEUR Workshop Proceedings, Pisa, Italy, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, 2024, pp. 707–719. URL: https://aclanthology.org/ A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, 2024.clicit-1.77/.

A. Hartshorn, A. Yang, The llama 3 herd of mod- [11] P. Basile, E. Musacchio, M. Polignano, L. Siciliani, els, 2024. URL: https://arxiv.org/abs/2407.21783. G. Fiameni, G. Semeraro, Llamantino: Llama 2 arXiv:2407.21783. models for efective text generation in italian lan[5] Y. Xie, K. Aggarwal, A. Ahmad, Eficient con- guage, 2023. URL: https://arxiv.org/abs/2312.09993. tinual pre-training for building domain specific arXiv:2312.09993. large language models, in: L.-W. Ku, A. Mar- [12] W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhartins, V. Srikumar (Eds.), Findings of the Associa- gava, R. Hou, L. Martin, R. Rungta, K. A. Sankararation for Computational Linguistics: ACL 2024, As- man, B. Oguz, M. Khabsa, H. Fang, Y. Mehdad, sociation for Computational Linguistics, Bangkok, S. Narang, K. Malik, A. Fan, S. Bhosale, S. Edunov, Thailand, 2024, pp. 10184–10201. URL: https: M. Lewis, S. Wang, H. Ma, Efective long-context scaling of foundation models, in: K. Duh, H. Gomez, ural language benchmark, in: L. Chiruzzo, A. Ritter, S. Bethard (Eds.), Proceedings of the 2024 Con- L. Wang (Eds.), Proceedings of the 2025 Conference ference of the North American Chapter of the of the Nations of the Americas Chapter of the AsAssociation for Computational Linguistics: Hu- sociation for Computational Linguistics: Human man Language Technologies (Volume 1: Long Language Technologies (Volume 1: Long Papers), Papers), Association for Computational Linguis- Association for Computational Linguistics, Albutics, Mexico City, Mexico, 2024, pp. 4643–4663. querque, New Mexico, 2025, pp. 1469–1478. URL: URL: https://aclanthology.org/2024.naacl-long.260/. https://aclanthology.org/2025.naacl-long.68/. doi:10.18653/v1/2024.naacl-long.260. [20] G. Puccetti, M. Cassese, A. Esuli, The invalsi [13] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, Y. Liu, Ro- benchmarks: measuring the linguistic and mathformer: Enhanced transformer with rotary position ematical understanding of large language models embedding, Neurocomputing 568 (2024) 127063. in Italian, in: O. Rambow, L. Wanner, M. ApidiURL: https://www.sciencedirect.com/science/ anaki, H. Al-Khalifa, B. D. Eugenio, S. Schockaert article/pii/S0925231223011864. doi:https://doi. (Eds.), Proceedings of the 31st International Conferorg/10.1016/j.neucom.2023.127063. ence on Computational Linguistics, Association for [14] X. Liu, H. Yan, C. An, X. Qiu, D. Lin, Scaling laws Computational Linguistics, Abu Dhabi, UAE, 2025, of rope-based extrapolation, in: The Twelfth In- pp. 6782–6797. URL: https://aclanthology.org/2025. ternational Conference on Learning Representa- coling-main.453/. tions, ICLR 2024, Vienna, Austria, May 7-11, 2024, [21] S. Singh, A. Romanou, C. Fourrier, D. I. Adelani, J. G. OpenReview.net, 2024. URL: https://openreview. Ngui, D. Vila-Suero, P. Limkonchotiwat, K. Marchinet/forum?id=JO7k0SJ5V6. sio, W. Q. Leong, Y. Susanto, R. Ng, S. Longpre, [15] Y. Wu, Y. Gu, X. Feng, W. Zhong, D. Xu, Q. Yang, W.-Y. Ko, S. Ruder, M. Smith, A. Bosselut, A. Oh, H. Liu, B. Qin, Extending context window of A. F. T. Martins, L. Choshen, D. Ippolito, E. Ferrante, large language models from a distributional per- M. Fadaee, B. Ermis, S. Hooker, Global mmlu: Unspective, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen derstanding and addressing cultural and linguistic (Eds.), Proceedings of the 2024 Conference on Em- biases in multilingual evaluation, 2025. URL: https: pirical Methods in Natural Language Processing, //arxiv.org/abs/2412.03304. arXiv:2412.03304. Association for Computational Linguistics, Miami, [22] D. Hupkes, N. Bogoychev, Multiloko: a multilingual Florida, USA, 2024, pp. 7288–7301. URL: https: local knowledge benchmark for llms spanning 31 //aclanthology.org/2024.emnlp-main.414/. doi:10. languages, 2025. URL: https://arxiv.org/abs/2504. 18653/v1/2024.emnlp-main.414. 10356. arXiv:2504.10356. [16] Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, [23] M. Weber, D. Y. Fu, Q. Anthony, Y. Oren, S. Adams, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, A. Alexandrov, X. Lyu, H. Nguyen, X. Yao, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, V. Adams, B. Athiwaratkun, R. Chalamala, K. Chen, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, M. Ryabinin, T. Dao, P. Liang, C. Ré, I. Rish, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, C. Zhang, Redpajama: an open dataset for trainR. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, ing large language models, NeurIPS Datasets and Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, Benchmarks Track (2024).

Z. Qiu, Qwen2.5 technical report, 2025. URL: https: [24] A. Lozhkov, L. Ben Allal, L. von Werra, T. Wolf, //arxiv.org/abs/2412.15115. arXiv:2412.15115. Fineweb-edu: the finest collection of educa[17] L. Moroni, S. Conia, F. Martelli, R. Navigli, To- tional content, 2024. URL: https://huggingface. wards a more comprehensive evaluation for Italian co/datasets/HuggingFaceFW/fineweb-edu. doi: 10. LLMs, in: F. Dell’Orletta, A. Lenci, S. Montemagni, 57967/hf/2497.

R. Sprugnoli (Eds.), Proceedings of the 10th Italian [25] B. Goodson, Fine flan: Seqio to parquet so you don’t Conference on Computational Linguistics (CLiC- have to, https://https://huggingface.co/datasets/ it 2024), CEUR Workshop Proceedings, Pisa, Italy, Open-Orca/FLAN, 2023. 2024, pp. 584–599. URL: https://aclanthology.org/ [26] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. 2024.clicit-1.67/. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, [18] B. Magnini, R. Zanoli, M. Resta, M. Cimmino, A. Roberts, The flan collection: Designing data P. Albano, M. Madeddu, V. Patti, Evalita-llm: and methods for efective instruction tuning, 2023. Benchmarking large language models on ital- arXiv:2301.13688. ian, 2025. URL: https://arxiv.org/abs/2502.02289. [27] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. arXiv:2502.02289. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le, Fine[19] A. Seveso, D. Potertì, E. Federici, M. Mezzanzanica, tuned language models are zero-shot learners, 2022.

F. Mercorio, ITALIC: An Italian culture-aware nat- arXiv:2109.01652. [28] V. Sanh, A. Webson, C. Rafel, S. H. Bach, soning problems with language models, 2022.

L. Sutawika, Z. Alyafeai, A. Chafin, A. Stiegler, T. L. arXiv:arXiv:2206.14858.

Scao, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, [37] M. Polignano, P. Basile, G. Semeraro, Advanced S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, natural-based interaction for the italian language: N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, Llamantino-3-anita, 2024. URL: https://arxiv.org/ M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Baw- abs/2405.07101. arXiv:2405.07101. den, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. San- [38] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamtilli, T. Fevry, J. A. Fries, R. Teehan, T. Bers, S. Bi- ford, D. S. Chaplot, D. de las Casas, F. Bressand, derman, L. Gao, T. Wolf, A. M. Rush, Multitask G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.prompted training enables zero-shot task general- A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, ization, 2022. arXiv:2110.08207. T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https: [29] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Ko- //arxiv.org/abs/2310.06825. arXiv:2310.06825. rdi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. [39] A. Y. et al., Qwen3 technical report, 2025. URL: https: Dhanasekaran, A. Naik, D. Stap, E. Pathak, G. Kara- //arxiv.org/abs/2505.09388. arXiv:2505.09388. manolakis, H. G. Lai, I. Purohit, I. Mondal, J. An- [40] S. Banerjee, A. Lavie, METEOR: An automatic metderson, K. Kuznia, K. Doshi, M. Patel, K. K. Pal, ric for MT evaluation with improved correlation M. Moradshahi, M. Parmar, M. Purohit, N. Varshney, with human judgments, in: J. Goldstein, A. Lavie, P. R. Kaza, P. Verma, R. S. Puri, R. Karia, S. K. Sam- C.-Y. Lin, C. Voss (Eds.), Proceedings of the ACL pat, S. Doshi, S. Mishra, S. Reddy, S. Patro, T. Dixit, Workshop on Intrinsic and Extrinsic Evaluation X. Shen, C. Baral, Y. Choi, N. A. Smith, H. Hajishirzi, Measures for Machine Translation and/or SummaD. Khashabi, Super-naturalinstructions: General- rization, Association for Computational Linguisization via declarative instructions on 1600+ nlp tics, Ann Arbor, Michigan, 2005, pp. 65–72. URL: tasks, 2022. arXiv:2204.07705. https://aclanthology.org/W05-0909/. [30] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, [41] T. Kočiský, J. Schwarz, P. Blunsom, C. Dyer, K. M.

C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, Hermann, G. Melis, E. Grefenstette, The NarraS. Presser, C. Leahy, The pile: An 800gb dataset of di- tiveQA reading comprehension challenge, Transverse text for language modeling, 2020. URL: https: actions of the Association for Computational Lin//arxiv.org/abs/2101.00027. arXiv:2101.00027. guistics 6 (2018) 317–328. URL: https://aclanthology. [31] Q. Team, Qwen2.5-1m: Deploy your own qwen org/Q18-1023/. doi:10.1162/tacl_a_00023. with context length up to 1m tokens, 2025. URL: [42] A. Scirè, S. Conia, S. Ciciliano, R. Navigli, Echoes https://qwenlm.github.io/blog/qwen2.5-1m/. from alexandria: A large resource for multi[32] I. Loshchilov, F. Hutter, Decoupled weight decay lingual book summarization, in: A. Rogers, regularization, 2019. URL: https://arxiv.org/abs/ J. Boyd-Graber, N. Okazaki (Eds.), Findings of 1711.05101. arXiv:1711.05101. the Association for Computational Linguistics: [33] L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, ACL 2023, Association for Computational LinA. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, guistics, Toronto, Canada, 2023, pp. 853–867. H. Li, K. McDonell, N. Muennighof, C. Ociepa, URL: https://aclanthology.org/2023.findings-acl.54/. J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, doi:10.18653/v1/2023.findings-acl.54. L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, [43] Y. Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, A. Zou, The language model evaluation harness, Z. Du, X. Liu, A. Zeng, L. Hou, Y. Dong, J. Tang, 2024. URL: https://zenodo.org/records/12608602. J. Li, LongBench: A bilingual, multitask benchdoi:10.5281/zenodo.12608602. mark for long context understanding, in: L.-W. [34] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, Ku, A. Martins, V. Srikumar (Eds.), Proceedings H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, of the 62nd Annual Meeting of the Association R. Nakano, C. Hesse, J. Schulman, Training veri- for Computational Linguistics (Volume 1: Long ifers to solve math word problems, 2021. URL: https: Papers), Association for Computational Linguis//arxiv.org/abs/2110.14168. arXiv:2110.14168. tics, Bangkok, Thailand, 2024, pp. 3119–3137. URL: [35] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, https://aclanthology.org/2024.acl-long.172/. doi:10.

S. Basart, E. Tang, D. Song, J. Steinhardt, Measur- 18653/v1/2024.acl-long.172. ing mathematical problem solving with the math dataset, NeurIPS (2021). [36] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer,

H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur,

G. Gur-Ari, V. Misra, Solving quantitative rea

A. Timing and CO2 Emissions Estimates RedPajama. We retrieved the RedPajama dataset

from Hugging Face: https://huggingface.co/datasets/ togethercomputer/RedPajama-Data-V2. We performed To quantify both the computational efort and environ- deduplication using the provided metadata and extracted mental footprint of our training end experiments we com- the text from the ‘head‘ partition of each dump. For pute energy and CO2 estimates assuming: Average GPU Recipe-1, we used the 2023-14 dump, while for Recipes 2 power draw: 300 W under full load. Data-center PUE and 3 we additionally used dumps 2023-06, 2022-49, and (power usage efectiveness): 1.2. Grid emission factor: 2022-40. We filtered out texts with fewer than 500 words. 0.28 kg CO2/kWh (typical for the European grid). Gutenberg. We collected texts from Project Gutenberg Total energy consumed per GPU-hour is via Hugging Face: https://huggingface.co/datasets/manu/ project_gutenberg. kWh/GPUh = 0.3 kW × 1.2 = 0.36 kWh/GPUh, Fineweb-Edu. We used the Fineweb-Edu dataset from Hugging Face: https://huggingface.co/datasets/ and CO2 emitted per GPU-hour is HuggingFaceFW/fineweb-edu, specifically the sample-100BT branch. This is a random subset CO2/GPUh = 0.36 kWh × 0.28 kWkgh aofmt hineimfuullmdaqtuasaelitt.yFsocroRreecoipfe3-.81;, fwoer Rseelceicpteesd2paagneds3w,withe ≈ 0.10 kg CO2/GPUh. applied a threshold of 4.0.

Dolmino. The Dolmino data, specifically the math and We estimate that the continual training of four recipes, stackexchange subsets, were obtained from: https:// Recipe 1 (3.5 days) and Recipes 2, 3, and 316k (7 days huggingface.co/datasets/allenai/dolmino-mix-1124. each), on 64 GPUs corresponds to a total GPU-time of FLAN. We downloaded the FLAN dataset from https: ≈ 37 632 GPUh. //huggingface.co/datasets/allenai/dolma. We selected Using an emission factor of 0.10 kg CO2/GPUh, this only the examples using the following prompt formats: yields about 3.8 t CO2. fs_opt, fs_noopt, zs_opt, and zs_noopt. With respect to the instruction tuning process, consider- The Stack. We collected data from the Stack ing the same number of GPUs,the standard 4 096-token dataset at: https://huggingface.co/datasets/bigcode/ variant required approximately 3000 GPU-hours, emit- the-stack-v2-train-smol-ids. We included only ting roughly 3 t CO2. The long-context 16 384-token code samples from the refs/heads/master and variant ran for about double the time (6000 GPU-hours), refs/heads/main branches, and further filtered to producing approximately 6 tons of CO2. include only repositories with at least 10 GitHub stars. Books3. We used a previously obtained copy of the B. Data Processing Books3 dataset, which is no longer publicly available for download.

This Section outlines the data processing steps applied to the various datasets used in the three main recipes described in Section 3.1.

Benchmarks. We utilized the translated benchmarks from ITA-Bench [17], specifically leveraging the training sets (when available) from both the original and translated versions. We formatted these through defined prompts consistent with LM-Evaluation-Harness [33].

Wikisource. We downloaded the Hugging Face version of the Wikisource dataset, available at: https:// huggingface.co/datasets/wikimedia/wikisource. Gazzetta Uficiale. We downloaded the Hugging Face version of the Gazzetta Uficiale dataset, available at: https://huggingface.co/datasets/mii-llm/ gazzetta-uficiale.

Wikipedia. For Recipe-1, we used the Hugging Face version of the Wikipedia dataset, available at: https:// huggingface.co/datasets/wikimedia/wikipedia. While for Repice 2 and 3 we used an updated version collected and processed by us with pages created up to 2024.

C. INDAQA

In this Section, we present additional details on the dataset we built, INDAQA. We retain samples asking the same questions with diferent formulations, following the approach in NarrativeQA. This design choice preserves valuable linguistic variation that may prove instrumental for future analyses examining the efects of question reformulation on QA system performance. While we maintain paraphrased questions, we eliminate exact duplicates from the dataset, ensuring that each unique reference answer is preserved only once.

We present some of the discarded questions in Table 11. These samples were filtered using several RegEx. We refined the RegEx patterns by manually validating their impact on a subset of 17 documents (646 QA samples).

Finally, we also show the prompts used to generate these samples in Tables 12. To ensure uniqueness, all QA pairs for each book were generated in a single in1. Many instances follow a sentence completion style, where the correct completion has to be selected from the multiple options. 2. Additionally, certain samples depend on contextual information that is embedded within the answer choices themselves, making the removal of options infeasible without compromising the question quality. 3. Finally, some questions, while not strictly requiring all four options to be answerable, become insuficiently specific without the provided choices, potentially leading to ambiguous interpretations.

Question type

Cosa Chi Quale/i Come/In che modo Dove Perché Quanto/a/i/e Quando MISCELLANEA

Transl.

What Who Which How Where Why How much When OTHER Count

Moreover, the last two cases mostly require the model to

reproduce verbatim one of the choices, which is significantly diferent from the open-ended QA task.

After automatic and manual inspection, we found that ference step and were later deduplicated. This process the majority of samples in the Language Capability catewas repeated three times with diferent answer length gory sufer from these structural limitations, with many requirements. instances exhibiting multiple concurrent issues, resulting in the need for heavy modifications to be adopted. While Length Distribution of INDAQA compared to NarrativeQA Datasets such characteristics are appropriate for multiple-choice

NQA Dataset QA frameworks, they present significant challenges for 50 INNQDAAQavAeDraagtea:se4t7k tokens generative QA tasks. Consequently, we excluded all Lan

INDAQA average: 27k tokens guage Capability samples from our experiments, result40 Truncation point: 16k tokens ing in ITALIC-GEN containing exclusively instances from y the Culture and Commonsense category. cen30 We set up a pipeline to check and modify the remainreuq ing samples to ensure compatibility with the generative F20 QA setting. First, we employ Gemini-2.0-Flash to reformat statements not ending with a question mark (?) 10 into proper interrogative form, standardizing the format across all instances (issue number 1). We also require 0 the LLM to ensure proper coordination between question 0 50k 100k 1To5k0ekn Co2u0n0tk 250k 300k 350k and answer. Manual verification of the results identified three instances that required correction where automatic dFaigtausreet,2IN:HDAisQtoAg,raanmd sthhoewteinstgstehteofdNifearerrnacteivsebQeAtw(NeeQnAo)u.r reformatting failed to produce valid questions.

Then, we filter the samples that would become unanswerable without access to the multiple-choice options (issue number 2) by first using a set of RegEx (both on questions and correct choices), and then employing the D. ITALIC-GEN LLM to classify samples based on the context provided This Section provides additional details on the adaptation in the question alone. We applied this validation proof the ITALIC dataset [19] from a multiple-choice format cess to the whole dataset, both original and reformatted to a free-form generative QA setting. Such adaptations samples. During the initial inspection of the samples, we must extend beyond simply extracting correct answers noted that the third issue predominantly afects samples from the provided options, requiring systematic analysis in the Language Capability category. Since ITALIC-GEN of the underlying sample characteristics and question exclusively comprises Culture and Commonsense samtypes. ples, we did not implement additional filtering based on

The original ITALIC dataset contains 10,000 instances this criterion. We do acknowledge that some instances divided into two primary categories: Language Capabil- in ITALIC-GEN may present significant challenges for ity and Culture and Commonsense. Due to the hetero- current generative QA systems. geneous nature of the underlying data sources, not all samples adhere to the standard question format. Specifically: 2 2 3

Question Choices

"The Young Pope" è il titolo della 1) Kim Rossi Stuart 2) Christian De Sica 3) Roberto Benigni 4) Paolo Sorserie ideata e diretta da: rentino Con l’espressione "Schiafo di 1) Lo schiafo che Anagni diede a papa Bonifacio VIII 2) L’ofesa che Bonifacio

Anagni" si è soliti indicare: VIII recò ad Anagni 3) L’oltraggio che subì papa Bonifacio VIII ad Anagni

4) Quale frase contiene un comple- 1) La ballerina aspettava con ansia il giorno del suo debutto 2) Sono andato mento di compagnia? al lago con mia sorella per prendere il sole 3) Il medico garantisce che con questa crema passerà il rossore 4) Con questa velocità non riuscirai mai a finire il lavoro per domani La frase "Sono felice" contiene: 1) un complemento oggetto 2) un complemento indiretto 3) un predicato nominale D) un predicato verbale

Declaration on Generative AI