<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Recipe</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>What we Learned from Continually Training Minerva: a Case Study on Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Moroni</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Bonomo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Giofré</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lu Xu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Fedele</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Colosi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrei Stefan Bejgu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Scirè</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Navigli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Babelscape</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sapienza NLP Group, Dip. di Ingegneria Informatica, Automatica e Gestionale, Sapienza University of Rome</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>3</volume>
      <issue>16</issue>
      <abstract>
        <p>Modern Large Language Models (LLMs) are commonly trained through a multi-stage pipeline encompassing pretraining and supervised finetuning. While recent studies have extensively investigated the benefits of continual pretraining on high-quality data, these eforts have focused primarily on English. In this work, we explore the efectiveness of various data mixtures in a continual pretraining setting to enhance performance on Italian-language tasks. Leveraging Minerva-7B, a fully opensource LLM pretrained on a corpus composed of 50% Italian, we define and evaluate three distinct data recipes-comprising mathematical, encyclopedic, and copyrighted content-spanning both Italian and English. We also investigate the efect of extending the model's context window during continual pretraining on its ability to handle long-context tasks. To support our evaluation, we introduce INDAQA, a new benchmark for narrative question answering in Italian. Our results reveal that both data composition and increased context length substantially improve performance, ofering valuable insights into continual pretraining strategies for less represented languages within an open scientific framework.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Italian</kwd>
        <kwd>Continual Pre-training</kwd>
        <kwd>Culturality</kwd>
        <kwd>Long Context</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>answering or summarization) or, more frequently, aim at
training general-purpose conversational models. This is
Modern Large Language Models (LLMs) are typically achieved by finetuning LLMs on hundreds of thousands
trained through a multi-stage process comprising pre- of conversations covering diverse domains. Through this
training, supervised fine-tuning (SFT), and preference process, models learn to follow instructions to perform
alignment. During pretraining, models are trained in an a wide range of tasks [7, 8, 9] and generate coherent
autoregressive manner to learn language in an unsuper- responses in dialogue-like interactions.
vised way, without requiring human-labeled data [1, 2]. While the overall LLM training pipeline has become
This phase allows models to acquire linguistic knowl- increasingly standardized, the role of curated data after
edge from large-scale, unstructured corpora. Recent ap- initial pretraining remains an active area of investigation
proaches [3, 4, 5, 6] structure the pretraining process into for further improving model capabilities. However, the
two steps. In the first, models are exposed to trillions of efects of continual training on curated data mixtures
reraw web-sourced tokens, with only a small portion of main poorly understood, particularly for less represented
high-quality content. In the second, training continues on languages such as Italian. To the best of our knowledge,
a curated set of high-quality language or domain-specific OLMo et al. [3] is the only work specifically addressing
texts, aiming to mitigate the impact of low-quality web the impact of data composition in an open-source setting;
content and extend the model’s exposure to up-to-date however, it is limited to the English language.
and informative content. In this work, we address this gap by systematically
in</p>
      <p>After the intensive pretraining phase—where LLMs are vestigating how incorporating high-quality data mixtures
trained solely on unlabeled data—models undergo super- during continual pretraining afects model performance
vised fine-tuning to adapt to real-world use cases. SFT on English- and Italian-language tasks. A particular
focan target either task-specific applications (e.g., question cus is placed on cultural knowledge evaluation, where
curated data is expected to play a crucial role in
enrichCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- ing the model’s ability to answer questions about Italian
*tiCcso,rSreepstpeomnbdeirng24a—uth2o6,r.2025, Cagliari, Italy cultural content. To this end, we build on the Minerva-7B
$ moroni@diag.uniroma1.it (L. Moroni); base model [10], a fully open-source LLM pretrained on
bonomo@diag.uniroma1.it (T. Bonomo); giofre@diag.uniroma1.it a balanced corpus of Italian and English data (50% each),
(L. Giofré); xu@diag.uniroma1.it (L. Xu); fedele@babelscape.com which provides a suitable foundation for evaluating
bilin(D. Fedele); colosi@babelscape.com (L. Colosi); gual continual pretraining strategies.
(bAej.gSuc@irèb)a;bnealvscigalpi@e.cdoimag.(uAn.iSr.oBmeaj1g.uit);(Rsc.irNea@vibgalbi)elscape.com Specifically, we define three distinct high-quality data
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License recipes for continual pretraining, varying in data
dimenAttribution 4.0 International (CC BY 4.0).
sions and source types, using both Italian and English Context Length Manipulation. Large Language
texts. These include content rich in mathematical rea- Models are typically pretrained with a fixed maximum
soning, encyclopedic knowledge, and copyrighted books. context length, which limits the number of tokens they
Through ablation studies, we examine the individual con- can process in a single sequence. Recent work by Xiong
tribution of specific data sources—such as copyrighted et al. [12] demonstrates how expanding the context
material and mathematical content—on downstream per- length of Llama-2 models–from 4,096 to 32,728 tokens–
formance across English and Italian benchmarks. can improve performance on long-context tasks. A
crit</p>
      <p>Additionally, we explore the efect of extending the ical aspect of long-context training is the choice of
pomodel’s maximum context length during continual pre- sitional encoding. Most modern LLMs employ Rotary
training, aiming to assess its impact on long-context un- Positional Embeddings (RoPE) [13], which encode token
derstanding. After pretraining, we instruction-tune the positions by rotating the query and key vectors in
attenvarious model variants using a bilingual (English and tion layers. This approach maintains relative positional
Italian) instruction-following dataset to evaluate their information and can be adapted for longer sequences.
performance in conversational settings. Recent studies show that modifying the RoPE base
fre</p>
      <p>Finally, to properly evaluate the influence of longer quency during continual pretraining enables models to
context and data composition, we introduce INDAQA, a handle longer contexts and even extrapolate beyond the
novel Italian benchmark for narrative question answering trained sequence lengths [14, 15]. Building on these
find(Section 6.1). Using INDAQA, we demonstrate the bene- ings, several recent LLMs have been released with
exifts of longer context windows and specific high-quality tended context capabilities. For example, Grattafiori et al.
data sources for complex language understanding tasks. [4] increases the context length of Llama-3 models from
8,192 to 128,000 tokens in the final stages of pretraining.</p>
      <p>Similarly, the Qwen model family [16] mostly supports
2. Related Work contexts up to 32,000 tokens. However, despite these
advancements, to the best of our knowledge, this paper
is the first that systematically investigates the impact of
context length manipulation on Italian-language tasks.</p>
      <sec id="sec-1-1">
        <title>Continual Training. Following the initial pretraining</title>
        <p>phase over trillions of tokens, it is now common practice
to introduce high-quality data in a subsequent training
stage to further enhance LLM performance and steer the
model’s distribution toward more controlled domains.
Recent research has increasingly focused on continual
pretraining as a practical and impactful approach. For
instance, OLMo et al. [3] and Grattafiori et al. [4]
introduce a mid-training stage that incorporates high-quality
datasets into the pretraining process, e.g. GSM8K
training set for mathematical reasoning. This stage is treated
as a continuation of the initial training, employing an
annealing learning rate that decays linearly to zero. This
approach has been shown to improve downstream
performance in tasks requiring structured reasoning and
encyclopedic knowledge recall.</p>
        <p>Continual training is also frequently employed to adapt
released open-weight LLMs to specific languages or
domains, thereby improving performance on targeted tasks.
Basile et al. [11] and others demonstrate that adapting
pretrained multilingual models to Italian using curated
high-quality data leads to significant improvements in
Italian-language benchmarks. Despite these advances,
there is still a lack of systematic studies that ablate and
isolate the specific contributions of diferent data mixing
strategies in the continual pretraining stage—particularly
for less represented languages like Italian. In our work,
we assess the impact of controlled data used in the
continual-pretraining stage, looking at their impact on
English and Italian performance.</p>
        <sec id="sec-1-1-1">
          <title>Evaluation of LLMs in Italian. Several recent eforts</title>
          <p>aim to close the evaluation gap between English and
Italian for generative LLMs. One of the first initiatives,
Ita-Bench [17], combines translated benchmarks with
natively authored Italian tasks, focusing on
instructionfollowing and question answering. Along the same lines,
Magnini et al. [18] reframes native Italian resources into
both multiple-choice and open-ended formats,
studying the role of prompting strategies. More recently,
ITALIC [19] introduces a multiple-choice question
answering dataset entirely written in Italian, covering
linguistic, cultural, and domain-specific knowledge. In
parallel, Puccetti et al. [20] adapts Invalsi assessments to
probe LLMs’ multi-domain abilities.</p>
          <p>Complementing these Italian-specific eforts,
multilingual benchmarks have also emerged.
GlobalMMLU [21] extends MMLU to multiple languages via
professional translation and cultural adaptation, while
MultiLOKO [22] provides culturally grounded questions
authored directly in each target language, including
Italian. While these benchmarks cover a variety of linguistic
and cultural aspects, they primarily focus on short-form
tasks. Yet, many real-world scenarios, such as
narrative comprehension and document-level reasoning,
require models to process and integrate information across
longer contexts. However, evaluation resources in Italian
remain limited in this dimension. To fill this gap, we
introduce INDAQA (Section 6.1), the first narrative question
answering benchmark designed to evaluate long-context
comprehension in Italian.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Methodology</title>
      <sec id="sec-2-1">
        <title>This work investigates the impact of continual training</title>
        <p>and the influence of diferent data sources on downstream
performance, with particular attention to copyrighted
material. Additionally, we aim to address a gap in the
literature regarding the efect of context length expansion
on performance in Italian.</p>
        <p>We focus on three key dimensions:
• Data recipes: we introduce three distinct recipes
designed to evaluate the role of data composition
during continual training.
• Context length: we describe how we adapt
models to long-context scenarios, using a selected data
mixture from the previous step.
• Instruction following: we examine the
instruction-following capabilities developed on
top of each training recipe.
Benchmarks
Wikipedia
RedPajama
Benchmarks
Wikipedia
Fineweb-edu
Gutenberg
FLAN
The Stack
Recipe-1
dataset with additional sources. For Italian, we included
3.1. Data Recipes for Wide Linguistic the Wikisource4 collection of articles, Gazzetta Uficiale, 5
Coverage which contains legislative and administrative acts of the
Italian State, and Project Gutenberg. For English, we
To evaluate the impact of various data sources on the con- incorporated subsets of the Dolmino-mix dataset, used
tinual training of an open-source LLM, namely Minerva- in the continual training of OLMo-2 [3], specifically the
7B base model, we define several data recipes, each rep- MATH and StackExchange (SE) components.
resenting a distinct mixture of training corpora. Table 1 The key distinction between Recipe-2 and Recipe-3
presents the data composition for one such configuration, is that Recipe-3 incorporates the Books3 dataset [30],
which we refer to as Recipe-11. This recipe incorporates which allows the impact of including closed-copyrighted
a diverse set of sources. For Italian, we include: the Ital- book content to be quantified. Further details on our data
ian Wikipedia (Hugging Face version, 2023 dump, Italian preprocessing steps can be found in Appendix B.
split)2 encyclopedic collection of text, RedPajama [23], a
web-based collection, and Ita-Bench [17], a suite of Italian 3.2. Long-context Adaptation
and English benchmarks for generative models (Italian
training split). Regarding English, the dataset comprises: Recent studies demonstrate that continual pre-training
Wikipedia (English split), Ita-Bench (English training can substantially extend the context length of LLMs [12,
split), Fineweb-edu [24], a web-based collection, Project 31]. Based on previous work and motivated by the lack of
Gutenberg,3 which comprises public-domain books, and a proper assessment of context expansion in Italian, we
FLAN [25, 26, 27, 28, 29], which contains diferent in- carry out the context length expansion on Recipe-3, our
structions for mathematical and logical reasoning. continually pre-trained model described in Section 3.1.</p>
        <p>Building on Recipe-1, we design two additional data Following the methodology of Xiong et al. [12], we extend
mixtures, Recipe-2 and Recipe-3, to evaluate the im- the maximum context length from 4,096 tokens (the
origipact of mathematical reasoning data and the inclusion of nal limit of Minerva-7B) to 16,384 tokens. This expansion
a large volume of copyrighted books. Table 2 shows requires adjusting the Rotary Position Embedding (RoPE)
the data composition for these two recipes. Starting base frequency  from 10,000 to 500,000 to
accommofrom the foundation of Recipe-1, we replace the standard date the increased sequence length. To establish baseline
Wikipedia dump with a curated and cleaned version col- comparisons, we adjust the RoPE base frequency in our
lected by us, updated to May 2024. We also expanded the continually-trained models obtained through the recipes
of Section 3.1 in order to adapt them to longer contexts.</p>
      </sec>
      <sec id="sec-2-2">
        <title>1Recipe-1 corresponds to the continual pretraining data used in the</title>
        <p>ifrst version of the released Minerva-7B.
2https://huggingface.co/datasets/wikimedia/wikipedia
3https://huggingface.co/datasets/manu/project_gutenberg</p>
      </sec>
      <sec id="sec-2-3">
        <title>4https://huggingface.co/datasets/wikimedia/wikisource 5https://huggingface.co/datasets/mii-llm/gazzetta-uficiale</title>
        <sec id="sec-2-3-1">
          <title>Dataset</title>
          <p>TÜLU-v3
LIMA
WildChat-IT
TowerBlocks-v0.2
GPT-4o-ITA-Instruct
Aya</p>
          <p>EN
IT/EN</p>
          <p>IT
IT/EN</p>
          <p>IT
IT</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experimental setup</title>
      <p>4.1. Continual training</p>
      <p>After continual pre-training, each recipe is converted
into an instruct model through an SFT stage on the
dialogue mixture summarised in Table 3. We base the
mixture on TÜLU-v3 [9], a popular open-source
940Kconversation corpus covering 85 task families (reasoning,
code, function-calling, safety, tool use, etc.) mined from
public APIs and manually filtered for policy compliance,
which provides the broad, structured competence
expected of modern assistants. To inject high-signal,
stylistically polished examples we add the 1000-turn LIMA
dataset [8] and its Italian counterpart LIMA-IT, produced
by us by translating every prompt/response pair with
GPT-4o-mini under a fidelity-preserving prompt; this
gives the model a high-quality set of concise, helpful
dialogue in both languages. We expand our selection
with additional Italian-centric datasets: i) WildChat-IT,
consisting of 5K informal prompts; ii) TowerBlocks-v0.2,
containing 7K bilingual it-en public-service Q&amp;A pairs;
iii) GPT-4o-ITA-Instruct, with 15K high-quality synthetic
chain-of-thought examples; and iv) Aya, which includes
700 role-play and reasoning turns, specifically targeting
colloquial language, public administration knowledge,
and culturally grounded reasoning.
4.2. Instruction finetuning</p>
      <sec id="sec-3-1">
        <title>Supervised fine-tuning was carried out with the</title>
        <p>LLAMA-Factory8 toolkit, which supports several
conversation templates and provides utilities for eficient
data parallelization. We fine-tuned the full Minerva-7B
weights (no LoRA/adapters) in bfloat 16 mixed
precision. Training lasted two epochs with a peak learning
rate of 1 × 10− 6 scheduled by cosine decay after a 10%
warm-up, and AdamW as the optimizer. We used an
6https://github.com/mosaicml/llm-foundry
7https://www.hpc.cineca.it/systems/hardware/leonardo/
8https://github.com/hiyouga/Llama-Factory
efective batch of 64 sequences (≈ 128 tokens). All
models were trained with a 4096-token context window,
except the long-context variant of Recipe-3, which
retained its 16384-token window. End-to-end, each recipe
consumed about 210 GPU-hours (240 for the long-context
run). Detailed timing and CO2 estimates are shown in
Appendix A.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Evaluation</title>
      <p>5.1. Language Modeling by Genre</p>
      <p>To evaluate the impact of the diferent data recipes, we 5.2. Multi-Choice Question Answering
analyze perplexity scores of trained LLMs on held-out
data from various genres. Specifically, we test the models To properly assess how diferent continual
pretrainon three distinct genres: Books, Wikipedia, and News. ing recipes influence LLM capabilities, we evaluate</p>
      <p>The Books set consists of 51 held-out books selected our trained models on a range of Italian-language
from Books3 [30], covering 25 diferent genres, in En- benchmarks. In this Section, we focus exclusively
glish languages. The Wikipedia set includes 50 Italian on the continually-trained models, before applying
pages from a 2025 snapshot9, excluded from the training any instruction tuning. This approach isolates the
efdata used in all recipes. The News set consists of 200 fects of continual pretraining and avoids biases
introItalian newspaper articles we independently collected duced by SFT data. We conduct evaluations using the
from 2025 publications, ensuring they were never seen LM-Evaluation-Harness [33] library, leveraging the
during any training step. Table 4 reports the language multi-choice format: a model’s next-token prediction is
modeling performance, measured by perplexity, across used to assess its QA ability.
these domains for each trained model. We evaluate the models using ITA-Bench [17],
select</p>
      <p>Regarding Books, incorporating Books3 into the train- ing a diverse set of tasks from the benchmark: AMI
ing mix significantly lowers perplexity, as seen in the (Misogyny Detection), GhigliottinAI (GH; a culturally
improved performance of Recipe-3. This indicates that in- grounded game), NERMUD (Named Entity Recognition),
cluding in-domain book content enhances generalization Prelearn (PL; Prerequisite Learning), ARC (Scientific
Reato literary-style text. Additionally, testing Recipe-316K soning), BoolQ (BQ; Boolean Questions), GSM8K
(Mathusing 16k context on Books drops the perplexity to 8.98, ematics), HellaSwag (HS; Textual Entailment), MMLU
further improving modeling on extended sequences. (Multi-domain QA), PIQA (Physical Interaction QA), and</p>
      <p>For the Wikipedia genre, all three recipes outperform SCIQ (Science Questions). For AMI, GhigliottinAI, and
the original pretrained model, demonstrating improved NERMUD, we use ITA-Bench’s cloze-style evaluation
ability to model high-quality encyclopedic text. Notably, format.</p>
      <p>Recipe-2 and Recipe-3 achieve the lowest perplexity, sug- Table 5 shows that all recipes of continual pretraining
gesting benefits from training on more recent and cleaner consistently improve over the pretrained model, with
Wikipedia texts. an average gain of approximately +5.0 points. This
re</p>
      <p>In contrast, for the News genre, perplexity diferences sult reinforces the importance of continual pretraining
among the recipes are minimal (±0.20), indicating a lim- on high-quality (e.g., Wikipedia, Fineweb-edu) and
synited impact of the training data variations on this domain. thetic datasets (e.g., FLAN, Dolmino-MATH subset).
NoInterestingly, the base model achieves the lowest perplex- tably, MMLU exhibits substantial improvements across
ity. all recipes (≈ +15 points), highlighting strong
generalization on multi-domain QA tasks. The best average
Bottom line: The modeling of literary-style texts and performance is achieved by Recipe-2 and the long-context
Wikipedia articles is influenced by the choice of contin- variant of Recipe-3. Recipe-1 underperforms, particularly
ual pretraining strategies, whereas News articles show no on math-related benchmarks such as ARC and GSM8K,
diferences. indicating the critical role of domain-specific data (e.g.,
Dolmino-MATH) in boosting model capabilities.</p>
      <sec id="sec-4-1">
        <title>9We process the May 1st, 2025 Wikipedia dump by first discarding</title>
        <p>pages with fewer than 500 tokens, and then sampling uniformly at
random from the resulting set.</p>
        <p>Bottom line: Continual pretraining consistently boosts
downstream performance; mathematical data improves</p>
        <p>STEM QA, while copyrighted books have minimal impact.
5.3. Mathematical Evaluation
To assess the impact of diferent continual-pretraining Recipe-1 2.48 14.70
recipes on math capabilities, we rely on two widely Recipe-2 9.57 34.42
used English mathematical benchmarks: GSM8k [34] and Recipe-3 8.96 26.45
MATH [35]. The former contains grade school math word Recipe-316K 10.26 32.29
problems, while the latter comprises challenging compe- Minerva Instruct Models
tition mathematics problems. We evaluate our models Recipe-1 10.14 24.63
using the LM-Evaluation-Harness [33], using its im- Recipe-2 12.84 42.45
plementations of both benchmarks. For GSM8k, we adopt Recipe-3 13.00 37.98
an 8-shot Chain-of-Thought prompting setup, while for Recipe-316K 12.82 40.25
MATH, we follow the Minerva-MATH [36] protocol, us- Italian-specific Models
ing 4-shot Chain-of-Thought prompting. Both
benchmarks use the generate_until setup, with model out- AONccIiTgAlo-8t-B7b 1170..5866 6409..6858
puts evaluated via post-processing for accuracy. We
compare our recipes to diferent open-source Italian (occiglot- English-first Models
7b-it-en-instruct10, ANITA-8B [37]) and multilingual Llama-3.1-8B 41.94 80.66
(Llama-3.1-8B [4], Mistral-7B [38], Qwen3-8B [39]) mod- Mistral-7B-v0.3 13.92 53.22
els, all in the same parameter range. Qwen3-8B 65.00 87.86</p>
        <p>Table 6 presents the results of tested models, with our Table 6
four continually pre-trained Minerva models evaluated Mathematical evaluation results on diferent Minerva
conboth before and after instruction tuning. On GSM8k, tinual pre-training recipes (before and after instruction
fineRecipe-2 achieves the highest accuracy in both settings, tuning) and State-of-the-Art models on Minerva-MATH
(4followed by Recipe-3, while Recipe-1 consistently un- shot) with sub-categories, and GSM8k (8-shot).
derperforms. Instruction tuning yields consistent
improvements across all recipes, reinforcing the overall
ranking and demonstrating its positive efect. These Bottom line: Continual pretraining on mathematical
ifndings suggest that incorporating mathematical data, data consistently improves accuracy on math problems.
such as Dolmino-MATH, during continual pre-training Instruction tuning on TULU-v3 helps mitigate the
shortplays a significant role in enhancing mathematical rea- comings of Recipe-1 on the MATH benchmark.
soning. For the MATH dataset, Recipes 2 and 3
outperform Recipe-1 in the base (pre-instruction tuning) setting, 5.4. Cultural Evaluation
particularly benefiting from long-context capabilities.
Interestingly, after instruction tuning, the performance gap We assess the impact of our recipes used during continual
narrows, with Recipe-1 becoming more competitive. pre-training by leveraging the Italian part of the
Multi</p>
        <p>When comparing Minerva models to state-of-the-art loko [22] dataset (250 instances), which provides
quessystems on GSM8k, they lag behind closed-data models tions on cultural content along with multiple acceptable
in both Italian and English. On the MATH dataset, Min- answers. We then compare our continually pre-trained
erva is comparable to Occiglot and Mistral, two closed- and instruction finetuned Minerva models to other Italian
data models, but still lags behind top-performing English- and English models, as in the previous section.
centric systems. This highlights the perfomance gap that According to the results in Table 7, Recipe-1 is the
Italian open-data LLMs must bridge. best performing model, both in Zero- and Few-Shot
settings, surpassing both the Italian-specific and the
English10https://huggingface.co/occiglot/occiglot-7b-it-en-instruct centric counterparts.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Long-context Evaluation on Narrative Text</title>
      <p>To evaluate the long-context capabilities of our model,
we focus on narrative question answering, a task that
requires the processing and understanding of
extensive narrative text in order to answer questions.
NarrativeQA [41], a widespread benchmark for this task, was
constructed in English, which limits its use for the
evaluation of long-context performance in other languages.
To address this limitation, we introduce INDAQA
(Section 6.1), a novel benchmark for Italian narrative question
answering, and, to the best of our knowledge, the first
narrative question answering dataset in Italian. We
describe the evaluation setup for base and instruction-tuned
models on both NarrativeQA and INDAQA in Section 6.2
and report the results in Section 6.3.
6.1. INDAQA - Italian Narrative DAtaset
for Question Answering</p>
      <sec id="sec-5-1">
        <title>Recipe-2 and Recipe-3, which are trained on a large</title>
        <p>amount of mathematics, code, and English-copyrighted We start building the dataset from the Italian split of
books, do not show the same cultural alignment in the Echoes from Alexandria [42], collecting 365 (book,
sumMultiLoKo Italian set. This observation demonstrates mary) pairs with full texts from Wikisource and
sumthat synthetic, mathematical, and English literary data maries from Wikipedia. After manually verifying
aligncan be detrimental for Italian cultural alignment. ment and removing plot-unrelated content from
sum</p>
        <p>Recently, Seveso et al. [19] have shown that Italian- maries, we prompt an LLM11 to generate 20
questionifrst models perform consistently lower than English-first answer pairs per book using the following guidelines: (i)
ones on the ITALIC dataset. We hypothesize that the questions must be unique, (ii) questions must be clear,
multiple choice format could be particularly problem- unambiguous, and answerable from the summary alone,
atic and might obscure the cultural knowledge recall of and (iii) each question requires having two short,
potenlanguage models. Therefore, we examine whether these tially diferent, reference answers.
results hold when reframing ITALIC in an open-ended After gathering a large number of samples, we filter
setting, which better reflects potential use cases for gen- them through three sequential steps. First, we
deduerative models. Details on how we reframed the dataset, plicate questions, but rather than discarding duplicates
ITALIC-GEN, are in Appendix D. entirely, we retain all unique answers as additional
ref</p>
        <p>We use METEOR [40] to evaluate the performance, erences for the remaining samples. We also preserve
as only one reference answer per question is available, diferent reformulations of identical questions, as
Narraand standard string matching metrics, such as EM, may tiveQA contains similar variations. Second, we remove
struggle when model outputs and references difer signif- unanswerable questions, i.e., samples containing invalid
icantly in phrasing and/or length. The results in Table 7 responses such as "Information not present in the
sumconfirm the trend seen in MultiLoKo, which again demon- mary." Finally, we filter out meta-questions that focus on
strates the cultural alignment capacity of Minerva models. structural rather than plot elements (e.g., "What happens
Our results further suggest that incorporating structured in chapter 3?" or "What is the title of the book?"). The last
mathematical data during pretraining can constrain a two filtering steps are carried out through a set of
manumodel’s acquisition of cultural knowledge. ally derived RegEx patterns. Examples of samples that
were filtered out are showcased in Table 11 (Appendix).</p>
        <p>Bottom line: Multiple-choice QA may not be well suited We reduce the average answer length so as to be
betfor evaluating cultural competence, as it limits expressive ter aligned with NarrativeQA by employing an LLM to
freedom and fails to capture the nuanced reasoning required shorten the replies. We perform this step only for the
for culturally-grounded responses. Notably, Italian-native samples having no reference answers with less than 5
models emerge to be the most aligned with Italian culture, tokens. The final statistics on the QA length are
prehighlighting the importance of language-specific pretrain- sented in Table 8. We manually validate generation and
ing.</p>
        <p>11We use Gemini-2.0-Flash and Gemini-2.0-Flash-Lite.</p>
        <sec id="sec-5-1-1">
          <title>Metric</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Avg. Length (Tokens) # Samples</title>
          <p>Question
1st Answer
2nd Answer
Question
1st Answer
2nd Answer
3rd Answer
4th Answer
5th Answer
ifltering steps on 17 documents (646 QA samples, 5% of
the dataset) spanning diverse summary lengths (18-1200
tokens). Each sample is annotated for acceptability using
the same criteria used for generation, yielding a 2.32%
error rate after filtering.</p>
          <p>Our final dataset, INDAQA, consists of texts with an
average length shorter than NarrativeQA (27k vs 47k
tokens) due to the prevalence of short stories and theatrical
plays.12 The size of the two datasets is comparable (365 vs
355 documents) with slightly more average QA samples
in INDAQA (37.83 vs 29.74). We also report the type of
questions in the dataset by analyzing the first few tokens
of the questions in Table 10 (Appendix). More details can
be found in Appendix C.</p>
          <p>Instruction-tuning evaluation We evaluate the
instruction-tuned versions of the Minerva continual
pretrained models alongside various systems, as in previous
sections. Benchmarking is conducted on both
NarrativeQA and INDAQA to assess real-world performance
in English and Italian narrative question answering. We
report METEOR [40] scores to measure answer quality
against the reference responses. We truncate the book
6.2. Long-context Evaluation Setup texts to 16,384 and 32,768 tokens to match our target
context lengths, following the approach used in
LongBase-model evaluation To evaluate the efectiveness Bench [43]. While some questions may require context
of our long-context continual training approach, we com- that is excluded by this truncation, all models are afected
pare Recipe-316K against Recipe-1, Recipe-2 and Recipe-3. equally, ensuring a fair comparison between them.
Except for Recipe-316K, we adapt each model to process
longer sequences by tuning the RoPE base frequency to 6.3. Results
 = 100,000. We assess each model’s ability to utilize
extended local context using an adapted version of Narra- In Figure 1 we present the results of our base-model
evaltiveQA and INDAQA. Specifically, we truncate each text uation. Our long-context adaptation of Recipe-3 clearly
at varying target context lengths (4,096, 8,192, 16,384 and enables the model to achieve a lower perplexity on the
an32,768 tokens), and we record the minimum perplexity swers of NarrativeQA and INDAQA at all context lengths
achieved by each model across the ground-truth answers tested, indicating an efective adaptation to long data. It is
when given the truncated text and respective questions. especially interesting to note the results at 32,768 tokens:
We assume that models efectively processing long con- adapting models continually trained with shorter context
texts will show lower perplexity on correct answers than lengths through RoPE frequency tuning is not enough to
those struggling with extended documents. avoid huge spikes in perplexity, while Recipe-316K is able
to efectively model text at double its continual training
context window.
12In our experiments, the input text is always truncated at 16k tokens.</p>
          <p>4K
4K
4K
16K
32K
8K
128K
32K
4K
4K
4K
16K
Minerva models
32K tokens of context, it ranks second only to Llama-3.1
and Mistral-v0.3, scoring 3.3 and 1.7 points lower
respectively on the METEOR metric. This performance gap is
expected, given that Recipe-316K’s continual training was
conducted at half the context length (16K tokens).</p>
          <p>Bottom line: Extending context length to 16K tokens via
continual pre-training improves modeling capabilities over
training-free methods and enhances robustness at 32K
tokens. Recipe-316K achieves strong narrative QA performance
in both English and Italian, outperforming Italian-specific
models and matching English-first LLMs.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusion</title>
      <sec id="sec-6-1">
        <title>This work explores the impact of data mixing strategies</title>
        <p>and long-context expansion on Italian language modeling.</p>
        <p>We conduct continual pretraining using three distinct
A Italian-specific Models data recipes and apply a unified instruction-following
ADQ occiglot-7b 32K 19.9 19.9 ifne-tuning approach to all resulting models. Our
evaluIN ANITA-8B 8K 7.5 7.0 ation assesses language modeling capabilities on
genrespecific data, highlighting that copyrighted books
in</p>
        <p>English-first Models cluded in the training recipes reduce perplexity on
litLlama-3.1-8B 128K 24.9 29.3 erary texts. We benchmark the proposed continual
preMistral-7B-v0.3 32K 22.5 27.7 training recipes across several multi-domain tasks, with
Table 9 a focus on mathematical reasoning, demonstrating that
Continual pre-training recipe evaluation on NarrativeQA and genre-specific data, such as mathematical texts and
highINDAQA after instruction fine-tuning. M@16k and M@32k quality web content contribute to overall performance
denote METEOR scores with 16,384 and 32,768 token book improvements, whereas copyrighted books do not
consiscontexts. Bold scores indicate best overall performance; un- tently ofer the same benefit. We also investigate cultural
derlined scores indicate best Italian-specific model. alignment, finding that English datasets, such as
mathematical texts and English-copyrighted books, can
negatively impact performance on culturally-aware
Italian</p>
        <p>Table 9 presents the results of the evaluation of our specific tasks. Additionally, our ITALIC-GEN adaptation
instruction-tuned models. As expected, Recipe-316K ofers a complementary perspective on cultural
evaluaachieves higher results on all settings, surpassing Recipe- tion, uncovering encouraging results for Italian LLMs.
1 on all experiments with books truncated to 16k tokens Lastly, we evaluate long-context capabilities through
narby 7.7 points on NarrativeQA and 8.6 on INDAQA. The rative question answering in both English and Italian.
diference is even larger when we extend the truncation Due to the absence of an Italian benchmark, we
introof books to 32K tokens, with Recipe-316K achieving 17.3 duced INDAQA, a new dataset for Italian narrative QA,
and 14.9 more METEOR points in NarrativeQA and IN- and show that extending the context length of a model
DAQA, respectively. consistently improves its downstream performance on</p>
        <p>Minerva models perform comparably to other models narrative QA.
of the same size, both Italian-specific
(occiglot-7b-it-eninstruct13, ANITA-8B [37]) and multilingual (Llama-3.1- Acknowledgments
8B [4], Mistral-7B [38]). On NarrativeQA, the Recipe-316K
variant achieves a METEOR score of 21.4 and 20.5 at a
context length of 16K and 32K respectively, ranking
behind Llama-3.1 and Mistral-v0.3. In contrast, the Minerva
model continually pre-trained with Recipe-316K
outperforms all tested models on INDAQA at 16K tokens of
context, achieving the highest METEOR score of 25.9. At</p>
      </sec>
      <sec id="sec-6-2">
        <title>Luca Moroni, Tommaso Bonomo, Luca Giofré and Lu</title>
        <p>Xu gratefully acknowledge the support of the AI
Factory IT4LIA project. Roberto Navigli acknowledges the
support of the PNRR MUR project PE0000013-FAIR. We
acknowledge ISCRA for awarding this project access to
the LEONARDO supercomputer, owned by the EuroHPC
Joint Undertaking, hosted by CINECA (Italy).
13https://huggingface.co/occiglot/occiglot-7b-it-en-instruct
//aclanthology.org/2024.findings-acl.606/. doi: 10.
18653/v1/2024.findings-acl.606.
[1] J. Hofmann, S. Borgeaud, A. Mensch, [6] L. Moroni, G. Puccetti, P.-L. Huguet Cabot, A. S.</p>
        <p>E. Buchatskaya, T. Cai, E. Rutherford, Bejgu, A. Miaschi, E. Barba, F. Dell’Orletta, A. Esuli,
D. de Las Casas, L. A. Hendricks, J. Welbl, R. Navigli, Optimizing LLMs for Italian: Reducing
A. Clark, T. Hennigan, E. Noland, K. Millican, token fertility and enhancing eficiency through
G. van den Driessche, B. Damoc, A. Guy, S. Osin- vocabulary adaptation, in: L. Chiruzzo, A.
Ritdero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, ter, L. Wang (Eds.), Findings of the Association for
L. Sifre, Training compute-optimal large language Computational Linguistics: NAACL 2025,
Associamodels, in: Proceedings of the 36th International tion for Computational Linguistics, Albuquerque,
Conference on Neural Information Processing New Mexico, 2025, pp. 6646–6660. URL: https://
Systems, NIPS ’22, Curran Associates Inc., Red aclanthology.org/2025.findings-naacl.371/.</p>
        <p>Hook, NY, USA, 2022. [7] N. Ding, Y. Chen, B. Xu, Y. Qin, S. Hu, Z. Liu,
[2] D. Groeneveld, I. Beltagy, E. Walsh, A. Bhagia, M. Sun, B. Zhou, Enhancing chat language
modR. Kinney, O. Tafjord, A. Jha, H. Ivison, I. Mag- els by scaling high-quality instructional
convernusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, sations, in: H. Bouamor, J. Pino, K. Bali (Eds.),
K. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, Proceedings of the 2023 Conference on Empirical
J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muen- Methods in Natural Language Processing,
Assonighof, A. Naik, C. Nam, M. Peters, V. Pyatkin, ciation for Computational Linguistics, Singapore,
A. Ravichander, D. Schwenk, S. Shah, W. Smith, 2023, pp. 3029–3051. URL: https://aclanthology.org/
E. Strubell, N. Subramani, M. Wortsman, P. Dasigi, 2023.emnlp-main.183/. doi:10.18653/v1/2023.
N. Lambert, K. Richardson, L. Zettlemoyer, J. Dodge, emnlp-main.183.</p>
        <p>K. Lo, L. Soldaini, N. Smith, H. Hajishirzi, OLMo: [8] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma,
Accelerating the science of language models, in: A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis,
L.-W. Ku, A. Martins, V. Srikumar (Eds.), Proceed- L. Zettlemoyer, O. Levy, Lima: Less is more for
ings of the 62nd Annual Meeting of the Associa- alignment, 2023. URL: https://arxiv.org/abs/2305.
tion for Computational Linguistics (Volume 1: Long 11206. arXiv:2305.11206.</p>
        <p>Papers), Association for Computational Linguis- [9] N. Lambert, J. Morrison, V. Pyatkin, S. Huang,
tics, Bangkok, Thailand, 2024, pp. 15789–15809. H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu,
URL: https://aclanthology.org/2024.acl-long.841/. N. Dziri, S. Lyu, Y. Gu, S. Malik, V. Graf, J. D.
doi:10.18653/v1/2024.acl-long.841. Hwang, J. Yang, R. L. Bras, O. Tafjord, C.
Wil[3] T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, helm, L. Soldaini, N. A. Smith, Y. Wang, P. Dasigi,
S. Arora, A. Bhagia, Y. Gu, S. Huang, M. Jordan, H. Hajishirzi, Tulu 3: Pushing frontiers in open
N. Lambert, D. Schwenk, O. Tafjord, T. Ander- language model post-training, 2025. URL: https:
son, D. Atkinson, F. Brahman, C. Clark, P. Dasigi, //arxiv.org/abs/2411.15124. arXiv:2411.15124.
N. Dziri, M. Guerquin, H. Ivison, P. W. Koh, J. Liu, [10] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S.
CoS. Malik, W. Merrill, L. J. V. Miranda, J. Morri- nia, E. Barba, S. Orlandini, G. Fiameni, R.
Navson, T. Murray, C. Nam, V. Pyatkin, A. Rangapur, igli, Minerva LLMs: The first family of large
M. Schmitz, S. Skjonsberg, D. Wadden, C. Wilhelm, language models trained from scratch on Italian
M. Wilson, L. Zettlemoyer, A. Farhadi, N. A. Smith, data, in: F. Dell’Orletta, A. Lenci, S. Montemagni,
H. Hajishirzi, 2 olmo 2 furious, 2025. URL: https: R. Sprugnoli (Eds.), Proceedings of the 10th Italian
//arxiv.org/abs/2501.00656. arXiv:2501.00656. Conference on Computational Linguistics
(CLiC[4] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, it 2024), CEUR Workshop Proceedings, Pisa, Italy,
A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, 2024, pp. 707–719. URL: https://aclanthology.org/
A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, 2024.clicit-1.77/.</p>
        <p>A. Hartshorn, A. Yang, The llama 3 herd of mod- [11] P. Basile, E. Musacchio, M. Polignano, L. Siciliani,
els, 2024. URL: https://arxiv.org/abs/2407.21783. G. Fiameni, G. Semeraro, Llamantino: Llama 2
arXiv:2407.21783. models for efective text generation in italian
lan[5] Y. Xie, K. Aggarwal, A. Ahmad, Eficient con- guage, 2023. URL: https://arxiv.org/abs/2312.09993.
tinual pre-training for building domain specific arXiv:2312.09993.
large language models, in: L.-W. Ku, A. Mar- [12] W. Xiong, J. Liu, I. Molybog, H. Zhang, P.
Bhartins, V. Srikumar (Eds.), Findings of the Associa- gava, R. Hou, L. Martin, R. Rungta, K. A.
Sankararation for Computational Linguistics: ACL 2024, As- man, B. Oguz, M. Khabsa, H. Fang, Y. Mehdad,
sociation for Computational Linguistics, Bangkok, S. Narang, K. Malik, A. Fan, S. Bhosale, S. Edunov,
Thailand, 2024, pp. 10184–10201. URL: https: M. Lewis, S. Wang, H. Ma, Efective long-context
scaling of foundation models, in: K. Duh, H. Gomez, ural language benchmark, in: L. Chiruzzo, A. Ritter,
S. Bethard (Eds.), Proceedings of the 2024 Con- L. Wang (Eds.), Proceedings of the 2025 Conference
ference of the North American Chapter of the of the Nations of the Americas Chapter of the
AsAssociation for Computational Linguistics: Hu- sociation for Computational Linguistics: Human
man Language Technologies (Volume 1: Long Language Technologies (Volume 1: Long Papers),
Papers), Association for Computational Linguis- Association for Computational Linguistics,
Albutics, Mexico City, Mexico, 2024, pp. 4643–4663. querque, New Mexico, 2025, pp. 1469–1478. URL:
URL: https://aclanthology.org/2024.naacl-long.260/. https://aclanthology.org/2025.naacl-long.68/.
doi:10.18653/v1/2024.naacl-long.260. [20] G. Puccetti, M. Cassese, A. Esuli, The invalsi
[13] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, Y. Liu, Ro- benchmarks: measuring the linguistic and
mathformer: Enhanced transformer with rotary position ematical understanding of large language models
embedding, Neurocomputing 568 (2024) 127063. in Italian, in: O. Rambow, L. Wanner, M.
ApidiURL: https://www.sciencedirect.com/science/ anaki, H. Al-Khalifa, B. D. Eugenio, S. Schockaert
article/pii/S0925231223011864. doi:https://doi. (Eds.), Proceedings of the 31st International
Conferorg/10.1016/j.neucom.2023.127063. ence on Computational Linguistics, Association for
[14] X. Liu, H. Yan, C. An, X. Qiu, D. Lin, Scaling laws Computational Linguistics, Abu Dhabi, UAE, 2025,
of rope-based extrapolation, in: The Twelfth In- pp. 6782–6797. URL: https://aclanthology.org/2025.
ternational Conference on Learning Representa- coling-main.453/.
tions, ICLR 2024, Vienna, Austria, May 7-11, 2024, [21] S. Singh, A. Romanou, C. Fourrier, D. I. Adelani, J. G.
OpenReview.net, 2024. URL: https://openreview. Ngui, D. Vila-Suero, P. Limkonchotiwat, K.
Marchinet/forum?id=JO7k0SJ5V6. sio, W. Q. Leong, Y. Susanto, R. Ng, S. Longpre,
[15] Y. Wu, Y. Gu, X. Feng, W. Zhong, D. Xu, Q. Yang, W.-Y. Ko, S. Ruder, M. Smith, A. Bosselut, A. Oh,
H. Liu, B. Qin, Extending context window of A. F. T. Martins, L. Choshen, D. Ippolito, E. Ferrante,
large language models from a distributional per- M. Fadaee, B. Ermis, S. Hooker, Global mmlu:
Unspective, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen derstanding and addressing cultural and linguistic
(Eds.), Proceedings of the 2024 Conference on Em- biases in multilingual evaluation, 2025. URL: https:
pirical Methods in Natural Language Processing, //arxiv.org/abs/2412.03304. arXiv:2412.03304.
Association for Computational Linguistics, Miami, [22] D. Hupkes, N. Bogoychev, Multiloko: a multilingual
Florida, USA, 2024, pp. 7288–7301. URL: https: local knowledge benchmark for llms spanning 31
//aclanthology.org/2024.emnlp-main.414/. doi:10. languages, 2025. URL: https://arxiv.org/abs/2504.
18653/v1/2024.emnlp-main.414. 10356. arXiv:2504.10356.
[16] Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, [23] M. Weber, D. Y. Fu, Q. Anthony, Y. Oren, S. Adams,
B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, A. Alexandrov, X. Lyu, H. Nguyen, X. Yao,
H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, V. Adams, B. Athiwaratkun, R. Chalamala, K. Chen,
J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, M. Ryabinin, T. Dao, P. Liang, C. Ré, I. Rish,
L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, C. Zhang, Redpajama: an open dataset for
trainR. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, ing large language models, NeurIPS Datasets and
Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, Benchmarks Track (2024).</p>
        <p>Z. Qiu, Qwen2.5 technical report, 2025. URL: https: [24] A. Lozhkov, L. Ben Allal, L. von Werra, T. Wolf,
//arxiv.org/abs/2412.15115. arXiv:2412.15115. Fineweb-edu: the finest collection of
educa[17] L. Moroni, S. Conia, F. Martelli, R. Navigli, To- tional content, 2024. URL: https://huggingface.
wards a more comprehensive evaluation for Italian co/datasets/HuggingFaceFW/fineweb-edu. doi: 10.
LLMs, in: F. Dell’Orletta, A. Lenci, S. Montemagni, 57967/hf/2497.</p>
        <p>R. Sprugnoli (Eds.), Proceedings of the 10th Italian [25] B. Goodson, Fine flan: Seqio to parquet so you don’t
Conference on Computational Linguistics (CLiC- have to, https://https://huggingface.co/datasets/
it 2024), CEUR Workshop Proceedings, Pisa, Italy, Open-Orca/FLAN, 2023.
2024, pp. 584–599. URL: https://aclanthology.org/ [26] S. Longpre, L. Hou, T. Vu, A. Webson, H. W.
2024.clicit-1.67/. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei,
[18] B. Magnini, R. Zanoli, M. Resta, M. Cimmino, A. Roberts, The flan collection: Designing data
P. Albano, M. Madeddu, V. Patti, Evalita-llm: and methods for efective instruction tuning, 2023.
Benchmarking large language models on ital- arXiv:2301.13688.
ian, 2025. URL: https://arxiv.org/abs/2502.02289. [27] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W.
arXiv:2502.02289. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le,
Fine[19] A. Seveso, D. Potertì, E. Federici, M. Mezzanzanica, tuned language models are zero-shot learners, 2022.</p>
        <p>F. Mercorio, ITALIC: An Italian culture-aware nat- arXiv:2109.01652.
[28] V. Sanh, A. Webson, C. Rafel, S. H. Bach, soning problems with language models, 2022.</p>
        <p>L. Sutawika, Z. Alyafeai, A. Chafin, A. Stiegler, T. L. arXiv:arXiv:2206.14858.</p>
        <p>Scao, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, [37] M. Polignano, P. Basile, G. Semeraro, Advanced
S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, natural-based interaction for the italian language:
N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, Llamantino-3-anita, 2024. URL: https://arxiv.org/
M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Baw- abs/2405.07101. arXiv:2405.07101.
den, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. San- [38] A. Q. Jiang, A. Sablayrolles, A. Mensch, C.
Bamtilli, T. Fevry, J. A. Fries, R. Teehan, T. Bers, S. Bi- ford, D. S. Chaplot, D. de las Casas, F. Bressand,
derman, L. Gao, T. Wolf, A. M. Rush, Multitask G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud,
M.prompted training enables zero-shot task general- A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang,
ization, 2022. arXiv:2110.08207. T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https:
[29] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Ko- //arxiv.org/abs/2310.06825. arXiv:2310.06825.
rdi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. [39] A. Y. et al., Qwen3 technical report, 2025. URL: https:
Dhanasekaran, A. Naik, D. Stap, E. Pathak, G. Kara- //arxiv.org/abs/2505.09388. arXiv:2505.09388.
manolakis, H. G. Lai, I. Purohit, I. Mondal, J. An- [40] S. Banerjee, A. Lavie, METEOR: An automatic
metderson, K. Kuznia, K. Doshi, M. Patel, K. K. Pal, ric for MT evaluation with improved correlation
M. Moradshahi, M. Parmar, M. Purohit, N. Varshney, with human judgments, in: J. Goldstein, A. Lavie,
P. R. Kaza, P. Verma, R. S. Puri, R. Karia, S. K. Sam- C.-Y. Lin, C. Voss (Eds.), Proceedings of the ACL
pat, S. Doshi, S. Mishra, S. Reddy, S. Patro, T. Dixit, Workshop on Intrinsic and Extrinsic Evaluation
X. Shen, C. Baral, Y. Choi, N. A. Smith, H. Hajishirzi, Measures for Machine Translation and/or
SummaD. Khashabi, Super-naturalinstructions: General- rization, Association for Computational
Linguisization via declarative instructions on 1600+ nlp tics, Ann Arbor, Michigan, 2005, pp. 65–72. URL:
tasks, 2022. arXiv:2204.07705. https://aclanthology.org/W05-0909/.
[30] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, [41] T. Kočiský, J. Schwarz, P. Blunsom, C. Dyer, K. M.</p>
        <p>C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, Hermann, G. Melis, E. Grefenstette, The
NarraS. Presser, C. Leahy, The pile: An 800gb dataset of di- tiveQA reading comprehension challenge,
Transverse text for language modeling, 2020. URL: https: actions of the Association for Computational
Lin//arxiv.org/abs/2101.00027. arXiv:2101.00027. guistics 6 (2018) 317–328. URL: https://aclanthology.
[31] Q. Team, Qwen2.5-1m: Deploy your own qwen org/Q18-1023/. doi:10.1162/tacl_a_00023.
with context length up to 1m tokens, 2025. URL: [42] A. Scirè, S. Conia, S. Ciciliano, R. Navigli, Echoes
https://qwenlm.github.io/blog/qwen2.5-1m/. from alexandria: A large resource for
multi[32] I. Loshchilov, F. Hutter, Decoupled weight decay lingual book summarization, in: A. Rogers,
regularization, 2019. URL: https://arxiv.org/abs/ J. Boyd-Graber, N. Okazaki (Eds.), Findings of
1711.05101. arXiv:1711.05101. the Association for Computational Linguistics:
[33] L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, ACL 2023, Association for Computational
LinA. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, guistics, Toronto, Canada, 2023, pp. 853–867.
H. Li, K. McDonell, N. Muennighof, C. Ociepa, URL: https://aclanthology.org/2023.findings-acl.54/.
J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, doi:10.18653/v1/2023.findings-acl.54.
L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, [43] Y. Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang,
A. Zou, The language model evaluation harness, Z. Du, X. Liu, A. Zeng, L. Hou, Y. Dong, J. Tang,
2024. URL: https://zenodo.org/records/12608602. J. Li, LongBench: A bilingual, multitask
benchdoi:10.5281/zenodo.12608602. mark for long context understanding, in: L.-W.
[34] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, Ku, A. Martins, V. Srikumar (Eds.), Proceedings
H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, of the 62nd Annual Meeting of the Association
R. Nakano, C. Hesse, J. Schulman, Training veri- for Computational Linguistics (Volume 1: Long
ifers to solve math word problems, 2021. URL: https: Papers), Association for Computational
Linguis//arxiv.org/abs/2110.14168. arXiv:2110.14168. tics, Bangkok, Thailand, 2024, pp. 3119–3137. URL:
[35] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, https://aclanthology.org/2024.acl-long.172/. doi:10.</p>
        <p>S. Basart, E. Tang, D. Song, J. Steinhardt, Measur- 18653/v1/2024.acl-long.172.
ing mathematical problem solving with the math
dataset, NeurIPS (2021).
[36] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer,</p>
        <p>H. Michalewski, V. Ramasesh, A. Slone, C. Anil,
I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur,</p>
        <p>G. Gur-Ari, V. Misra, Solving quantitative
rea</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>A. Timing and CO2 Emissions</title>
    </sec>
    <sec id="sec-8">
      <title>Estimates</title>
      <sec id="sec-8-1">
        <title>RedPajama. We retrieved the RedPajama dataset</title>
        <p>from Hugging Face: https://huggingface.co/datasets/
togethercomputer/RedPajama-Data-V2. We performed
To quantify both the computational efort and environ- deduplication using the provided metadata and extracted
mental footprint of our training end experiments we com- the text from the ‘head‘ partition of each dump. For
pute energy and CO2 estimates assuming: Average GPU Recipe-1, we used the 2023-14 dump, while for Recipes 2
power draw: 300 W under full load. Data-center PUE and 3 we additionally used dumps 2023-06, 2022-49, and
(power usage efectiveness): 1.2. Grid emission factor: 2022-40. We filtered out texts with fewer than 500 words.
0.28 kg CO2/kWh (typical for the European grid). Gutenberg. We collected texts from Project Gutenberg
Total energy consumed per GPU-hour is via Hugging Face: https://huggingface.co/datasets/manu/
project_gutenberg.
kWh/GPUh = 0.3 kW × 1.2 = 0.36 kWh/GPUh, Fineweb-Edu. We used the Fineweb-Edu dataset from
Hugging Face: https://huggingface.co/datasets/
and CO2 emitted per GPU-hour is HuggingFaceFW/fineweb-edu, specifically the
sample-100BT branch. This is a random subset
CO2/GPUh = 0.36 kWh × 0.28 kWkgh aofmt hineimfuullmdaqtuasaelitt.yFsocroRreecoipfe3-.81;, fwoer Rseelceicpteesd2paagneds3w,withe
≈ 0.10 kg CO2/GPUh. applied a threshold of 4.0.</p>
        <p>Dolmino. The Dolmino data, specifically the math and
We estimate that the continual training of four recipes, stackexchange subsets, were obtained from: https://
Recipe 1 (3.5 days) and Recipes 2, 3, and 316k (7 days huggingface.co/datasets/allenai/dolmino-mix-1124.
each), on 64 GPUs corresponds to a total GPU-time of FLAN. We downloaded the FLAN dataset from https:
≈ 37 632 GPUh. //huggingface.co/datasets/allenai/dolma. We selected
Using an emission factor of 0.10 kg CO2/GPUh, this only the examples using the following prompt formats:
yields about 3.8 t CO2. fs_opt, fs_noopt, zs_opt, and zs_noopt.
With respect to the instruction tuning process, consider- The Stack. We collected data from the Stack
ing the same number of GPUs,the standard 4 096-token dataset at: https://huggingface.co/datasets/bigcode/
variant required approximately 3000 GPU-hours, emit- the-stack-v2-train-smol-ids. We included only
ting roughly 3 t CO2. The long-context 16 384-token code samples from the refs/heads/master and
variant ran for about double the time (6000 GPU-hours), refs/heads/main branches, and further filtered to
producing approximately 6 tons of CO2. include only repositories with at least 10 GitHub stars.
Books3. We used a previously obtained copy of the
B. Data Processing Books3 dataset, which is no longer publicly available for
download.</p>
      </sec>
      <sec id="sec-8-2">
        <title>This Section outlines the data processing steps applied to the various datasets used in the three main recipes described in Section 3.1.</title>
        <p>Benchmarks. We utilized the translated
benchmarks from ITA-Bench [17], specifically
leveraging the training sets (when available) from both
the original and translated versions. We
formatted these through defined prompts consistent with
LM-Evaluation-Harness [33].</p>
        <p>Wikisource. We downloaded the Hugging Face
version of the Wikisource dataset, available at: https://
huggingface.co/datasets/wikimedia/wikisource.
Gazzetta Uficiale. We downloaded the
Hugging Face version of the Gazzetta Uficiale dataset,
available at: https://huggingface.co/datasets/mii-llm/
gazzetta-uficiale.</p>
        <p>Wikipedia. For Recipe-1, we used the Hugging Face
version of the Wikipedia dataset, available at: https://
huggingface.co/datasets/wikimedia/wikipedia. While for
Repice 2 and 3 we used an updated version collected and
processed by us with pages created up to 2024.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>C. INDAQA</title>
      <p>In this Section, we present additional details on the
dataset we built, INDAQA. We retain samples asking
the same questions with diferent formulations,
following the approach in NarrativeQA. This design choice
preserves valuable linguistic variation that may prove
instrumental for future analyses examining the efects
of question reformulation on QA system performance.
While we maintain paraphrased questions, we eliminate
exact duplicates from the dataset, ensuring that each
unique reference answer is preserved only once.</p>
      <p>We present some of the discarded questions in Table 11.
These samples were filtered using several RegEx. We
refined the RegEx patterns by manually validating their
impact on a subset of 17 documents (646 QA samples).</p>
      <p>Finally, we also show the prompts used to generate
these samples in Tables 12. To ensure uniqueness, all
QA pairs for each book were generated in a single
in1. Many instances follow a sentence completion style,
where the correct completion has to be selected
from the multiple options.
2. Additionally, certain samples depend on
contextual information that is embedded within the
answer choices themselves, making the removal
of options infeasible without compromising the
question quality.
3. Finally, some questions, while not strictly
requiring all four options to be answerable, become
insuficiently specific without the provided choices,
potentially leading to ambiguous interpretations.</p>
      <sec id="sec-9-1">
        <title>Question type</title>
        <p>Cosa
Chi
Quale/i
Come/In che modo
Dove
Perché
Quanto/a/i/e
Quando
MISCELLANEA</p>
      </sec>
      <sec id="sec-9-2">
        <title>Transl.</title>
        <p>What
Who
Which
How
Where
Why
How much
When
OTHER
Count</p>
        <sec id="sec-9-2-1">
          <title>Moreover, the last two cases mostly require the model to</title>
          <p>reproduce verbatim one of the choices, which is
significantly diferent from the open-ended QA task.</p>
          <p>After automatic and manual inspection, we found that
ference step and were later deduplicated. This process the majority of samples in the Language Capability
catewas repeated three times with diferent answer length gory sufer from these structural limitations, with many
requirements. instances exhibiting multiple concurrent issues, resulting
in the need for heavy modifications to be adopted. While
Length Distribution of INDAQA compared to NarrativeQA Datasets such characteristics are appropriate for multiple-choice</p>
          <p>NQA Dataset QA frameworks, they present significant challenges for
50 INNQDAAQavAeDraagtea:se4t7k tokens generative QA tasks. Consequently, we excluded all
Lan</p>
          <p>INDAQA average: 27k tokens guage Capability samples from our experiments,
result40 Truncation point: 16k tokens ing in ITALIC-GEN containing exclusively instances from
y the Culture and Commonsense category.
cen30 We set up a pipeline to check and modify the
remainreuq ing samples to ensure compatibility with the generative
F20 QA setting. First, we employ Gemini-2.0-Flash to
reformat statements not ending with a question mark (?)
10 into proper interrogative form, standardizing the format
across all instances (issue number 1). We also require
0 the LLM to ensure proper coordination between question
0 50k 100k 1To5k0ekn Co2u0n0tk 250k 300k 350k and answer. Manual verification of the results identified
three instances that required correction where automatic
dFaigtausreet,2IN:HDAisQtoAg,raanmd sthhoewteinstgstehteofdNifearerrnacteivsebQeAtw(NeeQnAo)u.r reformatting failed to produce valid questions.</p>
          <p>Then, we filter the samples that would become
unanswerable without access to the multiple-choice options
(issue number 2) by first using a set of RegEx (both on
questions and correct choices), and then employing the
D. ITALIC-GEN LLM to classify samples based on the context provided
This Section provides additional details on the adaptation in the question alone. We applied this validation
proof the ITALIC dataset [19] from a multiple-choice format cess to the whole dataset, both original and reformatted
to a free-form generative QA setting. Such adaptations samples. During the initial inspection of the samples, we
must extend beyond simply extracting correct answers noted that the third issue predominantly afects samples
from the provided options, requiring systematic analysis in the Language Capability category. Since ITALIC-GEN
of the underlying sample characteristics and question exclusively comprises Culture and Commonsense
samtypes. ples, we did not implement additional filtering based on</p>
          <p>The original ITALIC dataset contains 10,000 instances this criterion. We do acknowledge that some instances
divided into two primary categories: Language Capabil- in ITALIC-GEN may present significant challenges for
ity and Culture and Commonsense. Due to the hetero- current generative QA systems.
geneous nature of the underlying data sources, not all
samples adhere to the standard question format.
Specifically:
2
2
3</p>
        </sec>
      </sec>
      <sec id="sec-9-3">
        <title>Question Choices</title>
        <p>"The Young Pope" è il titolo della 1) Kim Rossi Stuart 2) Christian De Sica 3) Roberto Benigni 4) Paolo
Sorserie ideata e diretta da: rentino
Con l’espressione "Schiafo di 1) Lo schiafo che Anagni diede a papa Bonifacio VIII 2) L’ofesa che Bonifacio</p>
      </sec>
      <sec id="sec-9-4">
        <title>Anagni" si è soliti indicare: VIII recò ad Anagni 3) L’oltraggio che subì papa Bonifacio VIII ad Anagni</title>
        <p>4)
Quale frase contiene un comple- 1) La ballerina aspettava con ansia il giorno del suo debutto 2) Sono andato
mento di compagnia? al lago con mia sorella per prendere il sole 3) Il medico garantisce che
con questa crema passerà il rossore 4) Con questa velocità non riuscirai mai a
finire il lavoro per domani
La frase "Sono felice" contiene: 1) un complemento oggetto 2) un complemento indiretto 3) un predicato
nominale D) un predicato verbale</p>
        <p>Declaration on Generative AI</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>