<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Curated Data does not mean Representative Data when training Large Language Models: an Experiment using Representative Data for Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabio Tamburini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FICLIT - University of Bologna</institution>
          ,
          <addr-line>via Zamboni, 32, 40126, Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>It is widely accepted in literature that data curation is the first step for a successful pretraining of Large, and Small, Language Models (LLMs). Datasets generally fall into two categories: open datasets are publicly available, fostering transparency, reproducibility, and community-driven improvement, but they often face limitations in scale, diversity, and quality. Closed datasets, typically curated by private entities, can oefr greater scale, higher quality, and proprietary data sources, yet they raise concerns around transparency, bias auditing, and public accountability. This paper presents an experiment aimed at quantitatively measuring the improvements provided by representative datasets for LLM pretraining. We pretrained two small LLMs under the same experimental conditions as the corresponding Italian reference models from the Minerva family, evaluated their performance on standard benchmarks, and used LLM-as-aJudge to assess the Fluency, Coherence, and Relevance of generated texts on specific tasks. The results support the idea that, while open science and open datasets are important goals, representative corpora, even if closed, are more suitable for LLM pretraining, as they enable better performance under identical experimental conditions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLM pretraining</kwd>
        <kwd>representative corpora</kwd>
        <kwd>text generation evaluation</kwd>
        <kwd>LLM-as-a-judge</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>and societal dimensions [8].</p>
      <sec id="sec-1-1">
        <title>While early work relied heavily on broad, minimally</title>
        <p>
          Large language models (LLMs) have emerged as foun- filtered internet scrapes (e.g., Common Crawl), more
redational tools in Natural Language Processing (NLP), cent approaches have shifted toward structured,
transparpowering a wide array of applications from question ent, and task-specific datasets, often constructed through
answering and summarisation to code generation and a combination of automated and manual filtering
techscientific discovery. Their performance, generalisation niques [9]. These developments reflect a growing
recogability, and alignment with human values are deeply in- nition that model capabilities and behaviours are closely
lfuenced by the quality, diversity, and scale of the data tied to the provenance and properties of their training
used during pretraining [
          <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
          ]. As models grow larger and data. However, the field still lacks standardised
methodmore capable, the need for rigorous data curation prac- ologies and benchmarks for evaluating curated datasets,
tices becomes increasingly critical not only to enhance presenting challenges for reproducibility and
comparadownstream performance but also to mitigate harmful tive analysis.
biases, hallucinations, and environmental costs [3, 4].
        </p>
        <p>
          Data curation for LLMs involves the collection, filter- 1.1. Open vs. Closed Pretraining Datasets
ing, deduplication, classification, and documentation of
large-scale textual corpora. These processes aim to bal- The growing ecosystem of LLMs has revealed a sharp
ance scale with quality by removing low-signal, harmful, divide between open and closed approaches to data
cuor irrelevant content while preserving linguistic diver- ration. On one hand, open-source initiatives such as
sity and domain coverage [5, 6]. More recent eforts BLOOM [10], OPT [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], Pythia [11] and Minerva [12]
have highlighted that indiscriminate use of web-scale have committed to full transparency by using publicly
data may result in the propagation of social biases and available datasets and releasing detailed documentation
misinformation [7], emphasising the importance of care- of their training corpora. These eforts aim to promote
fully designed curation pipelines that consider ethical reproducibility, community-driven auditing, and
equitable access to foundation models. On the other hand,
CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- leading commercial models such as GPT-4, Claude and
tics, September 24 — 26, 2025, Cagliari, Italy Gemini rely on proprietary or undisclosed datasets,
rais* Corresponding author. ing questions about accountability, data provenance, and
$ hfatbtpio:/./tcaomrbpuorrian.fici@litu.unniibboo.i.itt/(FP.eoTpamle/bTuarminbi)urini/ (F. Tamburini) research reproducibility.
        </p>
        <p>0000-0001-7950-0347 (F. Tamburini) The open-data approach is grounded in scientific ideals
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License of transparency and collaborative validation. Models like
Attribution 4.0 International (CC BY 4.0).</p>
        <p>BLOOM, trained exclusively on open-access sources in- ity. RefinedWeb [ 14] features a deduplicated and
qualitycluding multilingual Common Crawl, Project Gutenberg, filtered web dataset used to train models such as Falcon.
and academic corpora, exemplify an efort to democratise It emphasises a scalable yet high-signal alternative to
LLM research and foster global participation [10]. The raw web scrapes. CulturaX [15] is large-scale
multilinopen release of datasets enables systematic study of data gual web dataset covering 167 languages, designed to
quality, bias, duplication, and domain representation, and improve the cultural and linguistic diversity of LLMs.
it supports downstream development of safer and more CulturaX emphasises inclusion of underrepresented
lanequitable AI systems. guages by sourcing and curating high-quality content</p>
      </sec>
      <sec id="sec-1-2">
        <title>In contrast, closed models often cite competitive, ethi- from Wikipedia, government websites, and news sources.</title>
        <p>cal, or legal reasons for withholding training data details. Books3 (from The Pile) is large collection of digitised
OpenAI’s GPT-4 report, for example, states that “given books, providing long-form narrative and expository text.
the competitive landscape and the safety implications Despite its utility, its inclusion has sparked debate due
of large-scale models,” they have opted not to disclose to copyright concerns, underscoring the need for clearer
training data sources. While this protects proprietary data usage norms.
advantages and potentially prevents misuse of harmful These datasets are frequently combined or customised
content, it also hinders external audits of data quality, depending on the training goals, whether for
generalbias, and copyright compliance. Without transparency, purpose models, multilingual capability, or
domainit becomes dificult to evaluate how model performance specific LLMs. CulturaX, in particular, represents a
growor behaviour may be influenced by specific sources or ing movement toward linguistic equity and cultural
inomissions. clusivity in large-scale model pretraining.</p>
        <p>This divergence has implications for the broader AI
research community. The lack of visibility into proprietary The efort to create open datasets for LLM pretraining
datasets exacerbates the reproducibility crisis in machine that cover a wide range of data inevitably encounters a
learning and limits eforts to assess environmental and major challenge: whether or not to include text types
social impacts of training practices. Conversely, open that are not freely available on the web. In our view, this
models, while more transparent, often contend with lim- is a critical issue when comparing LLMs trained on open
itations in data scope and quality due to the exclusion of data with their counterparts developed by large tech
comcopyrighted or paywalled content, potentially afecting panies using closed datasets, which undoubtedly include
their competitiveness in knowledge-rich domains. a richer and more representative variety of document</p>
        <p>Ultimately, the tension between open and closed data types for the language or languages being studied. The
paradigms reflects competing priorities in the develop- central concept here is representativeness, which Egbert
ment of foundation models: openness and accountability et al. [16] define as “the extent to which a corpus permits
versus competitive advantage and scalability. accurate generalisations about the target domain, which
involves two components: the extent to which the corpus
includes the full range of both text types and linguistic
1.2. Key Open Datasets for LLM distributions in a domain”. In essence, a representative</p>
        <sec id="sec-1-2-1">
          <title>Pretraining corpus should serve as a statistically valid sample of the</title>
          <p>A number of high-quality, publicly available datasets population of texts corresponding to the language variety
have become foundational to the training of open-source under investigation.
large language models. These datasets vary in terms of Another point regards the quality of texts published on
domain coverage, linguistic diversity, and preprocess- the Web when compared with curated and edited texts
ising strategies, but collectively represent the backbone of sued by professional publishers. Web texts and published
transparent and reproducible LLM development. texts difer significantly in form, purpose, authorship, and</p>
          <p>The Pile [5], a curated 825 GB dataset designed for audience engagement. Web texts, such as blog posts,
sotraining language models, combines diverse sources such cial media updates, and news articles, tend to be dynamic,
as academic articles (arXiv), code (GitHub), books, legal hyperlinked, and frequently updated. They emphasise
documents, and forums to maximise domain coverage. immediacy, brevity, and interactivity, often written in
C4 (Colossal Clean Crawled Corpus) [4] is a large-scale, an informal tone to encourage user engagement [17]. In
ifltered dataset derived from Common Crawl. It removes contrast, published texts like academic articles, books,
boilerplate, duplicates, and low-quality text to provide and journals are typically static, peer-reviewed, and
fola clean, general-purpose corpus for language modeling. low rigorous editorial standards. These texts prioritise
RedPajama [13] presents a reproducible, open alternative depth, permanence, and formal structure. Additionally,
to the Llama pretraining dataset. It aggregates content while published texts aim for scholarly credibility and
from Common Crawl, Wikipedia, ArXiv, StackExchange, longevity, Web texts often prioritise accessibility,
shareand more, with a focus on transparency and reproducibil- ability, and multimedia integration. Understanding these
distinctions is critical for analysing digital literacy and CORIS balancing. This corpus was constructed by
selectcommunication strategies in the information age and, in ing materials from the previously mentioned CulturaX
our opinion, is also critical for pretraining LLM providing project and incorporating large sections of specific
pub“good” and “reliable” texts for teaching a language to a lished texts. This extensive, curated, and representative
LLM. training corpus was then used to train our new
“CORIS</p>
          <p>This paper aims at exploring and quantify the difer- llm” language model, following the procedure described
ences in training a LLM either with open Web data or in the next section.
on a representative corpus examining if the two settings
produce some diferences in LLM performance, taking
contemporary Italian as the reference language. 3. Experiment Settings
3.1. LLM pretraining</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. A Representative Dataset</title>
      <sec id="sec-2-1">
        <title>Minerva is the first family of LLMs pretrained from</title>
        <p>
          Given the objective of this study, we introduce the refer- scratch on Italian [12] and emerged as a standard
reference corpus for contemporary Italian which we use as a ence for Italian NLP. A prior study pretrained an Italian
template for building the representative corpus employed model based on GPT-2 from scratch [
          <xref ref-type="bibr" rid="ref5">21</xref>
          ], but it used a
relin our experiments. atively small 117M-parameter set, making it not directly
comparable to modern LLMs or the more recent Minerva
family.
2.1. The CORIS Italian Corpus In order to perform a fair comparison with the
MinCORIS design was started in 1998 with the purpose of erva models, we adopted exactly the same pretraining
creating a representative, synchronic, general reference settings and hyperparameters described in [12]. We
precorpus of written contemporary Italian which would be trained the models using the MosaicML LLM-Foundry1
easily accessible and user-friendly [
          <xref ref-type="bibr" rid="ref3">18, 19</xref>
          ]. CORIS cur- package concentrating our eforts on two models: a
350Mrently contains 165 million words and has been updated parameter model trained on a single node equipped with
every three years by means of a monitor corpus [
          <xref ref-type="bibr" rid="ref4">20</xref>
          ]. It four A100-64GB GPUs for an equivalent number of steps
consists of a collection of authentic and commonly occur- as the Minerva-350M model and 1B parameter model
ring texts in electronic format chosen by virtue of their trained on 2 nodes in the same way as Minerva-1B2.
representativeness of contemporary Italian. While a 11.6 billion-token corpus is big enough for
pre
        </p>
        <p>
          After a long design process devoted to a careful defini- training a 350M model, it is too small, following the
Chintion of relevant textual macro-varieties and their propor- chilla rule [
          <xref ref-type="bibr" rid="ref6">22</xref>
          ] involving a parameter/token ratio of 1:20,
tions, CORIS has been structured as outlined in Table 1: for a 1B model, thus, in this second case, we could expect
the largest section, namely ‘Press’, contains newspapers some performance degradation.
and periodicals articles, ‘Fiction’ a collection of novels A detailed quality analysis of the Minerva dataset is
and short stories while scientific texts and legal/bureau- contained in the original paper [12].
cratic documents where included, respectively, in
‘Academic Prose’ and ‘L&amp;A Prose’. The last two sections 3.2. First Evaluation on Standard
contain respectively documents not belonging to the pre- Benchmarks
vious categories and texts belonging to Internet language
(mainly posts from high quality blogs).
        </p>
        <p>CORIS Section
Press
Fiction
Academic Prose
Legal &amp; Admin. Prose
Miscellanea
Ephemera</p>
        <p>Proportion
38%
25%
12%
10%
10%
5%</p>
      </sec>
      <sec id="sec-2-2">
        <title>Based on the general CORIS schema outlined in Table</title>
      </sec>
      <sec id="sec-2-3">
        <title>1, we created an 11.6 billion-token corpus that includes the same textual macro-varieties and maintains the same</title>
      </sec>
      <sec id="sec-2-4">
        <title>The evaluation of LLMs has traditionally relied on a suite</title>
        <p>of standardised benchmarks designed to assess a broad
range of linguistic, reasoning, and task-specific
capabilities. These benchmarks enable systematic comparison
across models and facilitate progress tracking in natural
language processing.</p>
        <p>
          To address the need for evaluating generation-based
tasks, LAMBADA [
          <xref ref-type="bibr" rid="ref7">23</xref>
          ] tests a model’s ability to predict
the final word of a passage based on broad context,
emphasising long-range dependency modelling. In parallel,
benchmarks such as WinoGrande [
          <xref ref-type="bibr" rid="ref8">24</xref>
          ] and HellaSwag
[
          <xref ref-type="bibr" rid="ref9">25</xref>
          ] target common-sense reasoning and
disambigua
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>1https://github.com/mosaicml/llm-foundry</title>
      </sec>
      <sec id="sec-2-6">
        <title>2https://huggingface.co/sapienzanlp/Minerva-XXX-base-v1.0</title>
        <p>ARC-C ARC-E BoolQ GSM8K</p>
        <p>MMLU PIQA SciQ TQA
32.6
32.6
31.9
Minerva-350M-base-v1.0 [12]
Minerva-350M-base-v1.0 (our)
CORISllm-350M-base
Minerva-1B-base-v1.0 [12]
Minerva-1B-base-v1.0 (our)
CORISllm-1B-base
tion, probing a model’s depth of understanding beyond 3.3. Text Generation Quality Evaluation
surface-level patterns.</p>
        <p>
          More recently, MMLU (Massive Multitask Language Evaluating LLM-generated texts is inherently
challengUnderstanding) has been introduced as a collaborative ing, and assessing the quality of these textual outputs is
efort to assess a wide range of LLM competencies rang- even more complex [28].
ing from law and medicine to physics and philosophy Our primary objective is to conduct a careful
evaluaofering a broad-spectrum evaluation across 57 subjects tion of the quality of texts generated by LLMs.
Specifito test a model’s ability to generalise across domains [
          <xref ref-type="bibr" rid="ref10">26</xref>
          ]. cally, we aim to compare an LLM trained on “open” but
        </p>
        <p>
          While existing evaluation benchmarks are highly valu- non-representative datasets, namely the Minerva family,
able, they are primarily designed to assess LLM perfor- with one trained on a representative and balanced dataset,
mance in English and are therefore not suitable for our CORISllm. The comparison focuses on commonly used
purposes. Recently, a group of Italian researchers intro- human evaluation metrics: Fluency, internal Coherence
duced a promising new benchmark, called ITA-Bench, for and text Relevance to the given task.
evaluating LLMs in Italian. This suite combines automat- To ensure a fair evaluation, it is necessary to generate
ically translated versions of popular English benchmarks and assess a substantial number of texts. For this purpose,
with adapted, manually curated datasets for Italian [
          <xref ref-type="bibr" rid="ref11">27</xref>
          ]. we adopted the LLM-as-a-Judge (LaaJ) approach, after a
We adopted ITA-Bench for the initial evaluation of our comparison of LLMs annotations with human judgments.
new LLMs and conducted a preliminary comparison with We designed six distinct prompts, each corresponding
an equivalent Minerva model. to one of the six CORIS macrovarieties: a short
newspa
        </p>
        <p>Table 2 presents the results of CORISllm-350M and per article, a children’s fairy tale, an abstract of a scientific
CORISllm-1B on ITA-Bench, alongside a comparison with paper, a judgment for a crime, a trip description, and a
the corresponding Minerva models. Overall, the two brief movie review, and generated 50 outputs each. Table
LLMs demonstrate comparable performance: Minerva 3 presents the prompts used to stimulate the LLMs to
performs better on certain tasks, while CORISllm slightly generate texts.
outperforms it on others. The following sections first describe the human
eval</p>
        <p>On average, the Minerva models show slightly better uation process, followed by the LaaJ methodology we
performance; however, these results must be interpreted employed to achieve our objective.
in light of the nature of the benchmark. ITA-Bench
focuses primarily on tasks involving commonsense rea- 3.3.1. Human Evaluation of LLM Outputs
soning and scientific knowledge retrieval, which are not Human evaluation remains the gold standard for
assesswell-suited for assessing diferences in text generation ing the quality of natural language outputs produced by
capabilities. Pretraining an LLM on a representative cor- LLMs. Despite the growing sophistication of automated
pus does not inherently confer an advantage in reasoning metrics and model-based evaluators, human judgments
or STEM-related tasks because the dataset used for pre- are uniquely capable of capturing nuanced dimensions of
training does not contain specific materials useful for quality such as contextual appropriateness, subtle
coherincreasing performance on STEM-related tasks and no ence errors, pragmatic relevance, and factual accuracy.
specific methods were used to promote the development Consequently, human assessments are widely used in
of reasoning abilities. Accordingly, CORISllm and Min- both benchmarking LLMs and validating automatic
evalerva perform similarly on ITA-Bench. To properly evalu- uation methods.
ate our research hypothesis, a more targeted assessment Human evaluation of LLM outputs is typically carried
of text generation abilities is required. out using either rating scales (e.g., Likert scales),
pairwise comparisons, or ranking protocols. Each approach
has strengths and limitations: scalar ratings allow
finegrained feedback but may sufer from rater calibration
issues, while relative comparisons often yield more
consistent judgments.</p>
        <p>In the context of LLM outputs, common evaluation
criteria include fluency, coherence, relevance, factual
accuracy, and harmlessness or bias. For instance, the HELM
benchmark [29] employs extensive human annotation
pipelines to assess these aspects. Fluency is often
reliably judged, but tasks like evaluating factual consistency
or detecting hallucinations present greater challenges.</p>
      </sec>
      <sec id="sec-2-7">
        <title>Human annotators are also crucial for detecting subtle</title>
        <p>harms, such as stereotyping or toxicity, which automated
tools frequently miss or misclassify [3].</p>
        <p>Despite its value, human evaluation has notable
limitations. It is expensive, time-consuming, and subject to
inter-rater variability, which can obscure subtle
diferences between systems. Additionally, annotator
background and task framing can influence outcomes. For
example, work has shown that crowdworker evaluations
can difer systematically from domain-expert judgments,
particularly on complex tasks like summarisation or
question answering [30].</p>
        <p>To compare the behaviour of the considered LaaJ
systems with human judgments, we conducted a small
experiment where three expert linguists manually evaluated
120 texts produced by Minerva-350M in response to the
six prompts given in Table 3. The annotators were asked
to evaluate the LLM-generated texts according to the
three selected metrics: the instructions given to them
was almost identical to the prompts in Tables 8, 9 and
10 we used for LaaJ. Table 4 (top-left section) shows the</p>
      </sec>
      <sec id="sec-2-8">
        <title>Spearman Rank Correlation Coeficients (SRCC) between</title>
        <p>the rankings provided by the three human annotators</p>
      </sec>
      <sec id="sec-2-9">
        <title>A1-A3, who assigned scores on a 5-point Likert scale. The correlations were relatively low, highlighting the challenges human annotators face in consistently grading text production using similar criteria.</title>
      </sec>
      <sec id="sec-2-10">
        <title>Due to the low correlations observed, particularly in</title>
        <p>the assessment of Fluency, we decided against using the
human annotations to calibrate our LaaJ systems and
chose to rely solely on the LaaJ methodology for the
evaluations.
3.3.2. LLMs as Automated Judges of Text Quality</p>
        <p>Recent advances LLMs have opened new avenues for
evaluating textual outputs in NLP. Traditionally, the
evaluation of text generation has relied heavily on human relevance.
judgments, which, while high in fidelity, are costly, time- Several studies have investigated the reliability of
consuming, and often inconsistent due to inter-annotator LLMs as automatic evaluators. For instance, G-Eval [32]
variability [31]. In contrast, LLMs such as GPT-3/4, Palm highlights that LLMs can approximate human judgments
and Gemini have demonstrated potential not only in gen- in multi-dimensional evaluation tasks when properly
erating text but also in providing reliable meta-judgments prompted. As shown in the nice review by Li et al. [33],
about language quality, including fluency, coherence, and it is possible to set up a framework where an LLM acts as
A1
Flu.</p>
        <p>Coh.</p>
        <p>Rel.</p>
        <p>A2
Flu.</p>
        <p>Coh.</p>
        <p>Rel.</p>
        <p>A3
Flu.</p>
        <p>Coh.</p>
        <p>Rel.</p>
        <p>Llama
Flu.</p>
        <p>Coh.</p>
        <p>Rel.</p>
        <p>A2
a zero-shot or few-shot judge, providing ordinal or scalar
ratings that correlate highly with human annotations.</p>
      </sec>
      <sec id="sec-2-11">
        <title>This correlation is particularly strong when the models are instructed explicitly to focus on specific dimensions of quality, such as grammatical fluency or semantic relevance.</title>
        <p>In terms of Fluency, LLMs have internalised extensive
grammatical structures through pretraining on large
corpora, enabling them to efectively recognise and assess
grammaticality and naturalness. For Coherence, models
evaluate the logical consistency and flow of ideas across
sentences or turns, especially when equipped with
context windows that span multiple paragraphs. Evaluating</p>
      </sec>
      <sec id="sec-2-12">
        <title>Relevance, the alignment of a response to a prompt or</title>
        <p>topic, has also been shown to benefit from LLMs’
contextual awareness and knowledge grounding.</p>
      </sec>
      <sec id="sec-2-13">
        <title>In summary, LLMs have emerged as credible tools for</title>
        <p>evaluating textual quality across multiple dimensions:
when applied with careful prompt design and
interpretative caution, they can serve as scalable, cost-efective
complements to human assessment.
p-value
CORISllm-350M</p>
        <p>Flu
Coh</p>
        <p>Rel
CORISllm-1B</p>
        <p>Flu
Coh</p>
        <p>Rel
Minerva-350M</p>
        <p>Flu
Coh</p>
        <p>Rel
Minerva-1B</p>
        <p>Flu
Coh
Rel</p>
        <sec id="sec-2-13-1">
          <title>In order to avoid any inconsistency introduced by hu- 600 new texts produced by CORISllm and Minerva mod</title>
          <p>man judgments, we decided to rely only on two diferent els (300 for each model). Correlations are all quite high</p>
        </sec>
        <sec id="sec-2-13-2">
          <title>LLMs for evaluating the quality of texts produced by and highly significant, thus we can reliably use these</title>
          <p>CORISllm and Minerva models. automatic judges for evaluating the textual production</p>
        </sec>
        <sec id="sec-2-13-3">
          <title>We adopted a powerful online LLM, namely Gemini- of the tested models.</title>
        </sec>
      </sec>
      <sec id="sec-2-14">
        <title>2.0-flash through Google APIs, and an ofline, quantised</title>
        <p>model, namely bartowski/Llama-3.3-70B-Instruct-Q6_K_L
downloaded from the Huggingface repository3. 4. Results</p>
        <p>Tables 8, 9 and 10 show the three prompts we have
designed for asking the two LaaJ to evaluate, using a Tables 6 and 7 present the means and standard deviations
5-point Likert scale, Fluency, Coherence and Relevance of of the scores assigned by the two judges across the three
the texts generated by CORISllm-350M/1B and Minerva- evaluation metrics to the 600 texts forming the evaluation
350M/1B. For designing these prompt we took inspiration dataset. The tables also display the results of a t-test
from similar prompts proposed in G-Eval [32]. The sepa- for independent samples, which assesses the statistical
rators ‘##SYSTEM##’, ‘##USER##’ and ’##ASSISTANT##’ significance of the diferences in means.
for marking the three diferent blocks of information in Examining the evaluations provided by Gemini, we
the prompts were replaced with empty lines for Gemini observe a notable increase in the scores assigned to
prompts and with the appropriate separators for prompts CORISllm-350M compared to the equivalent
Minervaproposed to the Llama judge. 350M model. Furthermore, the scores for it are so high</p>
        <p>To assess the reliability of their judgments, we first that they are comparable, and not significantly diferent,
evaluated the agreement between the two LaaJ systems to those of the larger Minerva-1B model, with the
excepand the human annotators. Table 4 also reports the SRCC tion of the Relevance metric. With regard to
CORISllmbetween each LaaJ and the human annotators. While the 1B, it performs much better than Minerva-350M, as
extwo LaaJ systems show high mutual correlation, their pected, and more or less on par with Minerva-1B,
exagreement with individual human annotators is lower, hibiting better performance on Fluency and worse on
though still comparable to the level of agreement ob- Relevance. All diferences that are statistically significant
served between human annotators themselves. This fur- are indicated by the asterisks next to the metric.
ther supports the case for favoring LaaJ-generated anno- Regarding the Llama-3.3-70B judge, CORISllm-350M
tations over those produced by humans. consistently receives significantly higher scores than the</p>
        <p>Table 5 shows the (SRCCs) between Gemini-2.0-flash equivalent Minerva-350M model, and its scores are
comand the quantised Llama-3.3-70B judges when evaluating parable to those of the Minerva-1B model across all
metrics. Using this judge, CORISllm-1B performs much better
than both Minerva models in a highly significant way,
CORISllm-350M</p>
        <p>Flu=2.74± 0.96
Coh=2.01± 0.80
Rel=1.97± 1.39
CORISllm-1B
Flu=3.1± 0.98
Coh=2.18± 0.92
Rel=1.95± 1.35</p>
        <p>←
Flu*
Coh*</p>
        <p>Rel*
←
Flu*
Coh*</p>
        <p>Rel*
↑
Flu
Coh</p>
        <p>Rel*
Flu* ←</p>
        <p>Coh
Rel*</p>
        <p>↑</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Discussion &amp; Conclusions</title>
      <sec id="sec-3-1">
        <title>When evaluating the textual production of equiva</title>
        <p>In this study, we examined how the choice of data for lent models across Fluency, internal Coherence, and
RelLLM pretraining afects performance, emphasizing the evance to the assigned task, CORISllm outperformed
importance of using a representative corpus to enhance Minerva. Due to the limited dimensions of the training
the quality of text produced by generative LLMs. corpus, suitable to pretrain 350M models and less 1B
mod</p>
        <p>Using the design framework of the CORIS corpus, a els, the results are more neat on smaller models. In any
representative corpus of contemporary Italian, we pre- case, this points in the direction that using representative
trained two LLMs following exactly the same process and balanced corpora for LLM pretraining has an impact
used for the Minerva models [12]. However, instead of on performance. In our experiments, CORISllm-350M,
the original dataset, we used a new 11.6 billion-token rep- despite having only one-third of the model parameters,
resentative corpus specifically structured to align with performed nearly on par with Minerva-1B in terms of
the CORIS macrovarieties. generative text quality.</p>
        <p>##SYSTEM##
Tu sei un linguista esperto nella valutazione dei testi. Ti
verrà fornita la descrizione di un esercizio e lo svolgimento
di questo esercizio da parte di un’AI.</p>
        <p>Il tuo compito è valutare lo svolgimento in base a una metrica.</p>
        <p>Assicurati di leggere e comprendere attentamente queste
istruzioni. Tieni aperto questo documento durante la
revisione e consultalo quando necessario.</p>
        <p>Criteri di valutazione:
Coerenza (1-5): la qualità globale di tutte le frasi. Il testo
dovrebbe essere ben strutturato e ben organizzato. Il testo
non dovrebbe contenere solo un mucchio di informazioni
correlate, ma dovrebbe svilupparsi da una frase a un corpo
coerente di informazioni su un argomento.</p>
        <p>Fasi di valutazione:
1. Leggi attentamente lo svolgimento e identifica
l’argomento principale e i punti chiave. 2. Analizza il
contenuto di ogni frase e valuta se frasi successive sono legate
logicamente e strutturalmente. 3. Assegna un punteggio
per la coerenza su una scala da 1 a 5, dove 1 è il punteggio
più basso e 5 il punteggio più alto in base ai Criteri di
valutazione.</p>
      </sec>
      <sec id="sec-3-2">
        <title>The goal of this work was not to create a complete</title>
        <p>family of LLMs pretrained on representative corpora and
ready for production deployment. Rather, we aimed to
provide a proof-of-concept study that emphasises the
need for greater attention to training corpora in order to
develop better models.</p>
        <p>While the openness of training data is certainly a
valuable principle, the results presented here suggest that it is
equally important to incorporate high-quality published
texts into the training process in order to enhance
performance without altering the transformer model. Since
such materials are often protected by copyright, it is
essential to establish specific agreements with publishers.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Due to copyright restrictions on portions of our pre</title>
        <p>training corpus, we are unable to distribute it freely.</p>
      </sec>
      <sec id="sec-3-4">
        <title>CORISllm models are available upon request.</title>
        <p>##SYSTEM##
Tu sei un linguista esperto nella valutazione dei testi. Ti
verrà fornita la descrizione di un esercizio e lo svolgimento
di questo esercizio da parte di un’AI.</p>
        <p>Il tuo compito è valutare lo svolgimento in base a una metrica.
Assicurati di leggere e comprendere attentamente queste
istruzioni. Tieni aperto questo documento durante la
revisione e consultalo quando necessario.</p>
        <p>Criteri di valutazione:
Rilevanza (1-5): Lo svolgimento deve includere solo
informazioni allineate con la descrizione dell’esercizio. Dovrai
penalizzare gli svolgimenti che contengono informazioni o
argomenti non rilevanti rispetto alla descrizione.
Fasi di valutazione:
1. Leggi attentamente lo svolgimento e identifica
l’argomento principale e i punti chiave. 2. Confronta lo
svolgimento con la descrizione dell’esercizio. 3. Assegna un
punteggio di rilevanza da 1 a 5, dove 1 è il punteggio più
basso e 5 il punteggio più alto in base ai Criteri di valutazione.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>We acknowledge the CINECA4 award no. HP10CPJEG0
(project DARE4LLM) under the ISCRA initiative, for the
availability of HPC resources and support.
haylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. S. Purohit, U. S. Prashanth, E. Raf, A. Skowron,</p>
      <sec id="sec-4-1">
        <title>Koura, A. Sridhar, T. Wang, L. Zettlemoyer, OPT: L. Sutawika, O. Van Der Wal, Pythia: A Suite for</title>
      </sec>
      <sec id="sec-4-2">
        <title>Open Pre-trained Transformer Language Models, Analyzing Large Language Models Across Training</title>
        <p>2022. arXiv:2205.01068. and Scaling, in: A. Krause, E. Brunskill, K. Cho,
[3] E. M. Bender, T. Gebru, A. McMillan-Major, B. Engelhardt, S. Sabato, J. Scarlett (Eds.),
Proceed</p>
      </sec>
      <sec id="sec-4-3">
        <title>S. Shmitchell, On the Dangers of Stochastic Par- ings of the 40th International Conference on Ma</title>
        <p>rots: Can Language Models Be Too Big?, FAccT ’21, chine Learning, volume 202 of Proceedings of
Ma</p>
      </sec>
      <sec id="sec-4-4">
        <title>Association for Computing Machinery, New York, chine Learning Research, PMLR, 2023, pp. 2397–</title>
        <p>NY, USA, 2021, p. 610–623. 2430.
[4] J. Dodge, M. Sap, A. Marasović, W. Agnew, G. Il- [12] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S.
Coharco, D. Groeneveld, M. Mitchell, M. Gardner, Doc- nia, E. Barba, S. Orlandini, G. Fiameni, R. Navigli,
umenting large webtext corpora: A case study on Minerva LLMs: The first family of large language
the colossal clean crawled corpus, in: M.-F. Moens, models trained from scratch on Italian data, in:</p>
      </sec>
      <sec id="sec-4-5">
        <title>X. Huang, L. Specia, S. W.-t. Yih (Eds.), Proceed- F. Dell’Orletta, A. Lenci, S. Montemagni, R. Sprug</title>
        <p>ings of the 2021 Conference on Empirical Methods noli (Eds.), Proceedings of the 10th Italian
Conferin Natural Language Processing, Association for ence on Computational Linguistics (CLiC-it 2024),</p>
      </sec>
      <sec id="sec-4-6">
        <title>Computational Linguistics, Online and Punta Cana, CEUR Workshop Proceedings, Pisa, Italy, 2024, pp.</title>
        <p>Dominican Republic, 2021, pp. 1286–1305. 707–719.
[5] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, [13] M. Weber, D. Y. Fu, Q. Anthony, Y. Oren, S. Adams,</p>
      </sec>
      <sec id="sec-4-7">
        <title>C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, A. Alexandrov, X. Lyu, H. Nguyen, X. Yao,</title>
      </sec>
      <sec id="sec-4-8">
        <title>S. Presser, C. Leahy, The Pile: An 800GB Dataset V. Adams, B. Athiwaratkun, R. Chalamala, K. Chen,</title>
        <p>of Diverse Text for Language Modeling, 2020. M. Ryabinin, T. Dao, P. Liang, C. Ré, I. Rish,
arXiv:2101.00027. C. Zhang, RedPajama: an Open Dataset for
Train[6] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, ing Large Language Models, NeurIPS Datasets and</p>
      </sec>
      <sec id="sec-4-9">
        <title>M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the Benchmarks Track (2024).</title>
        <p>limits of transfer learning with a unified text-to-text [14] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru,
transformer, J. Mach. Learn. Res. 21 (2020). H. Alobeidli, A. Cappelli, B. Pannier, E. Almazrouei,
[7] A. Abid, M. Farooqi, J. Zou, Persistent Anti-Muslim J. Launay, The RefinedWeb Dataset for Falcon</p>
      </sec>
      <sec id="sec-4-10">
        <title>Bias in Large Language Models, in: Proceedings LLM: Outperforming Curated Corpora with Web</title>
        <p>of the 2021 AAAI/ACM Conference on AI, Ethics, Data Only, in: A. Oh, T. Naumann, A. Globerson,
and Society, Association for Computing Machinery, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in</p>
      </sec>
      <sec id="sec-4-11">
        <title>New York, NY, USA, 2021, p. 298–306. Neural Information Processing Systems, volume 36,</title>
        <p>[8] L. Weidinger, J. Uesato, M. Rauh, C. Grifin, P.-S. Curran Associates, Inc., 2023, pp. 79155–79172.</p>
      </sec>
      <sec id="sec-4-12">
        <title>Huang, J. Mellor, A. Glaese, M. Cheng, B. Balle, [15] T. Nguyen, C. V. Nguyen, V. D. Lai, H. Man, N. T.</title>
      </sec>
      <sec id="sec-4-13">
        <title>A. Kasirzadeh, C. Biles, S. Brown, Z. Kenton, Ngo, F. Dernoncourt, R. A. Rossi, T. H. Nguyen,</title>
      </sec>
      <sec id="sec-4-14">
        <title>W. Hawkins, T. Stepleton, A. Birhane, L. A. Hen- CulturaX: A cleaned, enormous, and multilingual</title>
        <p>dricks, L. Rimell, W. Isaac, J. Haas, S. Legassick, dataset for large language models in 167 languages,</p>
      </sec>
      <sec id="sec-4-15">
        <title>G. Irving, I. Gabriel, Taxonomy of Risks posed by in: N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci,</title>
      </sec>
      <sec id="sec-4-16">
        <title>Language Models, in: Proceedings of the 2022 ACM S. Sakti, N. Xue (Eds.), Proceedings of the 2024</title>
      </sec>
      <sec id="sec-4-17">
        <title>Conference on Fairness, Accountability, and Trans- Joint International Conference on Computational</title>
        <p>parency, FAccT ’22, Association for Computing Ma- Linguistics, Language Resources and Evaluation
chinery, New York, NY, USA, 2022, p. 214–229. (LREC-COLING 2024), ELRA and ICCL, Torino,
[9] N. Brandizzi, H. Abdelwahab, A. Bhowmick, Italia, 2024, pp. 4226–4237.</p>
      </sec>
      <sec id="sec-4-18">
        <title>L. Helmer, B. J. Stein, P. Denisov, Q. Saleem, [16] J. Egbert, D. Biber, B. Gray, Corpus Representa</title>
      </sec>
      <sec id="sec-4-19">
        <title>M. Fromm, M. Ali, R. Rutmann, F. Naderi, M. S. Agy, tiveness: A Conceptual and Methodological Frame</title>
      </sec>
      <sec id="sec-4-20">
        <title>A. Schwirjow, F. Küch, L. Hahn, M. Ostendorf, P. O. work, Cambridge University Press, 2022, p. 52–67.</title>
      </sec>
      <sec id="sec-4-21">
        <title>Suarez, G. Rehm, D. Wegener, N. Flores-Herr, J. Köh- [17] S. C. Herring, Computer-Mediated Discourse Anal</title>
        <p>ler, J. Leveling, Data Processing for the OpenGPT-X ysis: An Approach to Researching Online Behavior,</p>
      </sec>
      <sec id="sec-4-22">
        <title>Model Family, 2024. arXiv:2410.08800. Learning in Doing: Social, Cognitive and Compu[10] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hess- tational Perspectives, Cambridge University Press, low, many others., BLOOM: A 176B-Parameter 2004, p. 338–376.</title>
      </sec>
      <sec id="sec-4-23">
        <title>Open-Access Multilingual Language Model, 2023. [18] F. Tamburini, I corpora del FICLIT, Università di</title>
        <p>arXiv:2211.05100. Bologna: CORIS/CODIS, BoLC e DiaCORIS, in:
[11] S. Biderman, H. Schoelkopf, Q. G. Anthony, Proceedings of the LIV Congresso Internazionale</p>
      </sec>
      <sec id="sec-4-24">
        <title>H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, di Studi della Società di Linguistica Italiana, 2022,</title>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) did not use any generative AI tools or services.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>in: Proceedings of the 34th International Conference on Neural Information Processing Systems</source>
          , NIPS '
          <volume>20</volume>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Roller,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dewan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. V.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mipp</surname>
          </string-name>
          .
          <volume>189</volume>
          -
          <fpage>197</fpage>
          . ceedings, Pisa, Italy,
          <year>2024</year>
          , pp.
          <fpage>584</fpage>
          -
          <lpage>599</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R. Rossini</given-names>
            <surname>Favretti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tamburini</surname>
          </string-name>
          , C. De Santis, [28]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , CORIS/CODIS:
          <article-title>A corpus of written Italian based H</article-title>
          .
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Yi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>Y. Zhang,</given-names>
          </string-name>
          <article-title>on a defined and a dynamic model</article-title>
          , in: A. Wilson,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          , A survey on P. Rayson,
          <string-name>
            <surname>T.</surname>
          </string-name>
          McEnery (Eds.),
          <article-title>A Rainbow of Cor- evaluation of large language models</article-title>
          ,
          <source>ACM Trans. pora: Corpus Linguistics and the Languages of the Intell. Syst. Technol</source>
          .
          <volume>15</volume>
          (
          <year>2024</year>
          ). World, Lincom-Europa, Munich,
          <year>2002</year>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>38</lpage>
          . [29]
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsipras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Soylu</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sinclair</surname>
          </string-name>
          , Corpus, Concordance, Collocation, Ox- et al.,
          <article-title>Holistic evaluation of language models</article-title>
          , Transford University Press,
          <year>1991</year>
          .
          <source>actions on Machine Learning Research</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Mattei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cafagna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          , [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Karpinska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Akoury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iyyer</surname>
          </string-name>
          ,
          <article-title>The perils of M. Guerini, Geppetto carves italian into a language using Mechanical Turk to evaluate open-ended text model</article-title>
          ,
          <source>in: Proceedings of the 7th Italian Conference generation, in: Proceedings of the 2021 Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2020</year>
          ),
          <source>CEUR on Empirical Methods in Natural Language ProWorkshop Proceedings</source>
          , Bologna, Italy,
          <year>2020</year>
          . cessing, Association for Computational Linguistics,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          , Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , E. Buchatskaya,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cai</surname>
          </string-name>
          , E. Rutherford, pp.
          <fpage>1265</fpage>
          -
          <lpage>1285</lpage>
          . D.
          <string-name>
            <surname>de Las Casas</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hendricks</surname>
            , J. Welbl, [31]
            <given-names>C. van der</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gatt</surname>
          </string-name>
          , E. van
          <string-name>
            <surname>Miltenburg</surname>
            , E. KrahA. Clark,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Hennigan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Noland</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Millican, mer, Human evaluation of automatically generated G</article-title>
          . van den Driessche,
          <string-name>
            <given-names>B.</given-names>
            <surname>Damoc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guy</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Osin- text: Current trends and best practice guidelines, dero</article-title>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Elsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Rae</surname>
          </string-name>
          ,
          <source>Computer Speech &amp; Language</source>
          <volume>67</volume>
          (
          <year>2021</year>
          ) 101151. L.
          <string-name>
            <surname>Sifre</surname>
          </string-name>
          ,
          <string-name>
            <surname>Training</surname>
          </string-name>
          compute-optimal
          <source>large language</source>
          [32]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , G-Eval: models,
          <source>in: Proceedings of the 36th International NLG Evaluation using Gpt-4 with Better Human Conference on Neural Information Processing Alignment</source>
          ,
          <source>in: Proceedings of the 2023 Conference Systems, NIPS '22</source>
          , Curran Associates Inc.,
          <article-title>Red on Empirical Methods in Natural Language ProHook</article-title>
          , NY, USA,
          <year>2022</year>
          . cessing, Association for Computational Linguistics,
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Paperno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kruszewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lazaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. Q.</given-names>
            <surname>Singapore</surname>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>2511</fpage>
          -
          <lpage>2522</lpage>
          . Pham,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bernardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pezzelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baroni</surname>
          </string-name>
          , G. Boleda, [33]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <article-title>The LAMBADA dataset: Word predic- Z.</article-title>
          <string-name>
            <surname>Ye</surname>
          </string-name>
          , Y. Liu,
          <article-title>LLMs-as-</article-title>
          <string-name>
            <surname>Judges</surname>
          </string-name>
          :
          <article-title>A Comprehensive tion requiring a broad discourse context</article-title>
          , in: K. Erk,
          <source>Survey on LLM-based Evaluation Methods</source>
          ,
          <year>2024</year>
          . N. A.
          <string-name>
            <surname>Smith</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 54th Annual arXiv:2412</source>
          .05579.
          <article-title>Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Berlin, Germany,
          <year>2016</year>
          , pp.
          <fpage>1525</fpage>
          -
          <lpage>1534</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sakaguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Le</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Winogrande:</surname>
          </string-name>
          <article-title>An adversarial winograd schema challenge at scale</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>8732</fpage>
          -
          <lpage>8740</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zellers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bisk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Choi,</surname>
          </string-name>
          <article-title>HellaSwag: Can a machine really finish your sentence?, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>4791</fpage>
          -
          <lpage>4800</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mazeika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <article-title>Measuring massive multitask language understanding</article-title>
          ,
          <source>in: Proceedings of the International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Conia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Martelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          , Itabench:
          <article-title>Towards a more comprehensive evaluation for Italian LLMs</article-title>
          , in: F.
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Montemagni</surname>
          </string-name>
          , R. Sprugnoli (Eds.),
          <source>Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ), CEUR Workshop Pro-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>