1. Introduction

Curated Data does not mean Representative Data when training Large Language Models: an Experiment using Representative Data for Italian

Fabio Tamburini

0 0 FICLIT - University of Bologna , via Zamboni, 32, 40126, Bologna , Italy

2025

It is widely accepted in literature that data curation is the first step for a successful pretraining of Large, and Small, Language Models (LLMs). Datasets generally fall into two categories: open datasets are publicly available, fostering transparency, reproducibility, and community-driven improvement, but they often face limitations in scale, diversity, and quality. Closed datasets, typically curated by private entities, can oefr greater scale, higher quality, and proprietary data sources, yet they raise concerns around transparency, bias auditing, and public accountability. This paper presents an experiment aimed at quantitatively measuring the improvements provided by representative datasets for LLM pretraining. We pretrained two small LLMs under the same experimental conditions as the corresponding Italian reference models from the Minerva family, evaluated their performance on standard benchmarks, and used LLM-as-aJudge to assess the Fluency, Coherence, and Relevance of generated texts on specific tasks. The results support the idea that, while open science and open datasets are important goals, representative corpora, even if closed, are more suitable for LLM pretraining, as they enable better performance under identical experimental conditions.

eol>LLM pretraining representative corpora text generation evaluation LLM-as-a-judge

1. Introduction

and societal dimensions [8].

While early work relied heavily on broad, minimally

Large language models (LLMs) have emerged as foun- filtered internet scrapes (e.g., Common Crawl), more redational tools in Natural Language Processing (NLP), cent approaches have shifted toward structured, transparpowering a wide array of applications from question ent, and task-specific datasets, often constructed through answering and summarisation to code generation and a combination of automated and manual filtering techscientific discovery. Their performance, generalisation niques [9]. These developments reflect a growing recogability, and alignment with human values are deeply in- nition that model capabilities and behaviours are closely lfuenced by the quality, diversity, and scale of the data tied to the provenance and properties of their training used during pretraining [ 1, 2 ]. As models grow larger and data. However, the field still lacks standardised methodmore capable, the need for rigorous data curation prac- ologies and benchmarks for evaluating curated datasets, tices becomes increasingly critical not only to enhance presenting challenges for reproducibility and comparadownstream performance but also to mitigate harmful tive analysis. biases, hallucinations, and environmental costs [3, 4].

Data curation for LLMs involves the collection, filter- 1.1. Open vs. Closed Pretraining Datasets ing, deduplication, classification, and documentation of large-scale textual corpora. These processes aim to bal- The growing ecosystem of LLMs has revealed a sharp ance scale with quality by removing low-signal, harmful, divide between open and closed approaches to data cuor irrelevant content while preserving linguistic diver- ration. On one hand, open-source initiatives such as sity and domain coverage [5, 6]. More recent eforts BLOOM [10], OPT [ 2 ], Pythia [11] and Minerva [12] have highlighted that indiscriminate use of web-scale have committed to full transparency by using publicly data may result in the propagation of social biases and available datasets and releasing detailed documentation misinformation [7], emphasising the importance of care- of their training corpora. These eforts aim to promote fully designed curation pipelines that consider ethical reproducibility, community-driven auditing, and equitable access to foundation models. On the other hand, CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- leading commercial models such as GPT-4, Claude and tics, September 24 — 26, 2025, Cagliari, Italy Gemini rely on proprietary or undisclosed datasets, rais* Corresponding author. ing questions about accountability, data provenance, and $ hfatbtpio:/./tcaomrbpuorrian.fici@litu.unniibboo.i.itt/(FP.eoTpamle/bTuarminbi)urini/ (F. Tamburini) research reproducibility.

0000-0001-7950-0347 (F. Tamburini) The open-data approach is grounded in scientific ideals © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License of transparency and collaborative validation. Models like Attribution 4.0 International (CC BY 4.0).

BLOOM, trained exclusively on open-access sources in- ity. RefinedWeb [ 14] features a deduplicated and qualitycluding multilingual Common Crawl, Project Gutenberg, filtered web dataset used to train models such as Falcon. and academic corpora, exemplify an efort to democratise It emphasises a scalable yet high-signal alternative to LLM research and foster global participation [10]. The raw web scrapes. CulturaX [15] is large-scale multilinopen release of datasets enables systematic study of data gual web dataset covering 167 languages, designed to quality, bias, duplication, and domain representation, and improve the cultural and linguistic diversity of LLMs. it supports downstream development of safer and more CulturaX emphasises inclusion of underrepresented lanequitable AI systems. guages by sourcing and curating high-quality content

In contrast, closed models often cite competitive, ethi- from Wikipedia, government websites, and news sources.

cal, or legal reasons for withholding training data details. Books3 (from The Pile) is large collection of digitised OpenAI’s GPT-4 report, for example, states that “given books, providing long-form narrative and expository text. the competitive landscape and the safety implications Despite its utility, its inclusion has sparked debate due of large-scale models,” they have opted not to disclose to copyright concerns, underscoring the need for clearer training data sources. While this protects proprietary data usage norms. advantages and potentially prevents misuse of harmful These datasets are frequently combined or customised content, it also hinders external audits of data quality, depending on the training goals, whether for generalbias, and copyright compliance. Without transparency, purpose models, multilingual capability, or domainit becomes dificult to evaluate how model performance specific LLMs. CulturaX, in particular, represents a growor behaviour may be influenced by specific sources or ing movement toward linguistic equity and cultural inomissions. clusivity in large-scale model pretraining.

This divergence has implications for the broader AI research community. The lack of visibility into proprietary The efort to create open datasets for LLM pretraining datasets exacerbates the reproducibility crisis in machine that cover a wide range of data inevitably encounters a learning and limits eforts to assess environmental and major challenge: whether or not to include text types social impacts of training practices. Conversely, open that are not freely available on the web. In our view, this models, while more transparent, often contend with lim- is a critical issue when comparing LLMs trained on open itations in data scope and quality due to the exclusion of data with their counterparts developed by large tech comcopyrighted or paywalled content, potentially afecting panies using closed datasets, which undoubtedly include their competitiveness in knowledge-rich domains. a richer and more representative variety of document

Ultimately, the tension between open and closed data types for the language or languages being studied. The paradigms reflects competing priorities in the develop- central concept here is representativeness, which Egbert ment of foundation models: openness and accountability et al. [16] define as “the extent to which a corpus permits versus competitive advantage and scalability. accurate generalisations about the target domain, which involves two components: the extent to which the corpus includes the full range of both text types and linguistic 1.2. Key Open Datasets for LLM distributions in a domain”. In essence, a representative

Pretraining corpus should serve as a statistically valid sample of the

A number of high-quality, publicly available datasets population of texts corresponding to the language variety have become foundational to the training of open-source under investigation. large language models. These datasets vary in terms of Another point regards the quality of texts published on domain coverage, linguistic diversity, and preprocess- the Web when compared with curated and edited texts ising strategies, but collectively represent the backbone of sued by professional publishers. Web texts and published transparent and reproducible LLM development. texts difer significantly in form, purpose, authorship, and

The Pile [5], a curated 825 GB dataset designed for audience engagement. Web texts, such as blog posts, sotraining language models, combines diverse sources such cial media updates, and news articles, tend to be dynamic, as academic articles (arXiv), code (GitHub), books, legal hyperlinked, and frequently updated. They emphasise documents, and forums to maximise domain coverage. immediacy, brevity, and interactivity, often written in C4 (Colossal Clean Crawled Corpus) [4] is a large-scale, an informal tone to encourage user engagement [17]. In ifltered dataset derived from Common Crawl. It removes contrast, published texts like academic articles, books, boilerplate, duplicates, and low-quality text to provide and journals are typically static, peer-reviewed, and fola clean, general-purpose corpus for language modeling. low rigorous editorial standards. These texts prioritise RedPajama [13] presents a reproducible, open alternative depth, permanence, and formal structure. Additionally, to the Llama pretraining dataset. It aggregates content while published texts aim for scholarly credibility and from Common Crawl, Wikipedia, ArXiv, StackExchange, longevity, Web texts often prioritise accessibility, shareand more, with a focus on transparency and reproducibil- ability, and multimedia integration. Understanding these distinctions is critical for analysing digital literacy and CORIS balancing. This corpus was constructed by selectcommunication strategies in the information age and, in ing materials from the previously mentioned CulturaX our opinion, is also critical for pretraining LLM providing project and incorporating large sections of specific pub“good” and “reliable” texts for teaching a language to a lished texts. This extensive, curated, and representative LLM. training corpus was then used to train our new “CORIS

This paper aims at exploring and quantify the difer- llm” language model, following the procedure described ences in training a LLM either with open Web data or in the next section. on a representative corpus examining if the two settings produce some diferences in LLM performance, taking contemporary Italian as the reference language. 3. Experiment Settings 3.1. LLM pretraining

2. A Representative Dataset Minerva is the first family of LLMs pretrained from

Given the objective of this study, we introduce the refer- scratch on Italian [12] and emerged as a standard reference corpus for contemporary Italian which we use as a ence for Italian NLP. A prior study pretrained an Italian template for building the representative corpus employed model based on GPT-2 from scratch [ 21 ], but it used a relin our experiments. atively small 117M-parameter set, making it not directly comparable to modern LLMs or the more recent Minerva family. 2.1. The CORIS Italian Corpus In order to perform a fair comparison with the MinCORIS design was started in 1998 with the purpose of erva models, we adopted exactly the same pretraining creating a representative, synchronic, general reference settings and hyperparameters described in [12]. We precorpus of written contemporary Italian which would be trained the models using the MosaicML LLM-Foundry1 easily accessible and user-friendly [ 18, 19 ]. CORIS cur- package concentrating our eforts on two models: a 350Mrently contains 165 million words and has been updated parameter model trained on a single node equipped with every three years by means of a monitor corpus [ 20 ]. It four A100-64GB GPUs for an equivalent number of steps consists of a collection of authentic and commonly occur- as the Minerva-350M model and 1B parameter model ring texts in electronic format chosen by virtue of their trained on 2 nodes in the same way as Minerva-1B2. representativeness of contemporary Italian. While a 11.6 billion-token corpus is big enough for pre

After a long design process devoted to a careful defini- training a 350M model, it is too small, following the Chintion of relevant textual macro-varieties and their propor- chilla rule [ 22 ] involving a parameter/token ratio of 1:20, tions, CORIS has been structured as outlined in Table 1: for a 1B model, thus, in this second case, we could expect the largest section, namely ‘Press’, contains newspapers some performance degradation. and periodicals articles, ‘Fiction’ a collection of novels A detailed quality analysis of the Minerva dataset is and short stories while scientific texts and legal/bureau- contained in the original paper [12]. cratic documents where included, respectively, in ‘Academic Prose’ and ‘L&A Prose’. The last two sections 3.2. First Evaluation on Standard contain respectively documents not belonging to the pre- Benchmarks vious categories and texts belonging to Internet language (mainly posts from high quality blogs).

CORIS Section Press Fiction Academic Prose Legal & Admin. Prose Miscellanea Ephemera

Proportion 38% 25% 12% 10% 10% 5%

Based on the general CORIS schema outlined in Table 1, we created an 11.6 billion-token corpus that includes the same textual macro-varieties and maintains the same The evaluation of LLMs has traditionally relied on a suite

of standardised benchmarks designed to assess a broad range of linguistic, reasoning, and task-specific capabilities. These benchmarks enable systematic comparison across models and facilitate progress tracking in natural language processing.

To address the need for evaluating generation-based tasks, LAMBADA [ 23 ] tests a model’s ability to predict the final word of a passage based on broad context, emphasising long-range dependency modelling. In parallel, benchmarks such as WinoGrande [ 24 ] and HellaSwag [ 25 ] target common-sense reasoning and disambigua

1https://github.com/mosaicml/llm-foundry 2https://huggingface.co/sapienzanlp/Minerva-XXX-base-v1.0

ARC-C ARC-E BoolQ GSM8K

MMLU PIQA SciQ TQA 32.6 32.6 31.9 Minerva-350M-base-v1.0 [12] Minerva-350M-base-v1.0 (our) CORISllm-350M-base Minerva-1B-base-v1.0 [12] Minerva-1B-base-v1.0 (our) CORISllm-1B-base tion, probing a model’s depth of understanding beyond 3.3. Text Generation Quality Evaluation surface-level patterns.

More recently, MMLU (Massive Multitask Language Evaluating LLM-generated texts is inherently challengUnderstanding) has been introduced as a collaborative ing, and assessing the quality of these textual outputs is efort to assess a wide range of LLM competencies rang- even more complex [28]. ing from law and medicine to physics and philosophy Our primary objective is to conduct a careful evaluaofering a broad-spectrum evaluation across 57 subjects tion of the quality of texts generated by LLMs. Specifito test a model’s ability to generalise across domains [ 26 ]. cally, we aim to compare an LLM trained on “open” but

While existing evaluation benchmarks are highly valu- non-representative datasets, namely the Minerva family, able, they are primarily designed to assess LLM perfor- with one trained on a representative and balanced dataset, mance in English and are therefore not suitable for our CORISllm. The comparison focuses on commonly used purposes. Recently, a group of Italian researchers intro- human evaluation metrics: Fluency, internal Coherence duced a promising new benchmark, called ITA-Bench, for and text Relevance to the given task. evaluating LLMs in Italian. This suite combines automat- To ensure a fair evaluation, it is necessary to generate ically translated versions of popular English benchmarks and assess a substantial number of texts. For this purpose, with adapted, manually curated datasets for Italian [ 27 ]. we adopted the LLM-as-a-Judge (LaaJ) approach, after a We adopted ITA-Bench for the initial evaluation of our comparison of LLMs annotations with human judgments. new LLMs and conducted a preliminary comparison with We designed six distinct prompts, each corresponding an equivalent Minerva model. to one of the six CORIS macrovarieties: a short newspa

Table 2 presents the results of CORISllm-350M and per article, a children’s fairy tale, an abstract of a scientific CORISllm-1B on ITA-Bench, alongside a comparison with paper, a judgment for a crime, a trip description, and a the corresponding Minerva models. Overall, the two brief movie review, and generated 50 outputs each. Table LLMs demonstrate comparable performance: Minerva 3 presents the prompts used to stimulate the LLMs to performs better on certain tasks, while CORISllm slightly generate texts. outperforms it on others. The following sections first describe the human eval

On average, the Minerva models show slightly better uation process, followed by the LaaJ methodology we performance; however, these results must be interpreted employed to achieve our objective. in light of the nature of the benchmark. ITA-Bench focuses primarily on tasks involving commonsense rea- 3.3.1. Human Evaluation of LLM Outputs soning and scientific knowledge retrieval, which are not Human evaluation remains the gold standard for assesswell-suited for assessing diferences in text generation ing the quality of natural language outputs produced by capabilities. Pretraining an LLM on a representative cor- LLMs. Despite the growing sophistication of automated pus does not inherently confer an advantage in reasoning metrics and model-based evaluators, human judgments or STEM-related tasks because the dataset used for pre- are uniquely capable of capturing nuanced dimensions of training does not contain specific materials useful for quality such as contextual appropriateness, subtle coherincreasing performance on STEM-related tasks and no ence errors, pragmatic relevance, and factual accuracy. specific methods were used to promote the development Consequently, human assessments are widely used in of reasoning abilities. Accordingly, CORISllm and Min- both benchmarking LLMs and validating automatic evalerva perform similarly on ITA-Bench. To properly evalu- uation methods. ate our research hypothesis, a more targeted assessment Human evaluation of LLM outputs is typically carried of text generation abilities is required. out using either rating scales (e.g., Likert scales), pairwise comparisons, or ranking protocols. Each approach has strengths and limitations: scalar ratings allow finegrained feedback but may sufer from rater calibration issues, while relative comparisons often yield more consistent judgments.

In the context of LLM outputs, common evaluation criteria include fluency, coherence, relevance, factual accuracy, and harmlessness or bias. For instance, the HELM benchmark [29] employs extensive human annotation pipelines to assess these aspects. Fluency is often reliably judged, but tasks like evaluating factual consistency or detecting hallucinations present greater challenges.

Human annotators are also crucial for detecting subtle

harms, such as stereotyping or toxicity, which automated tools frequently miss or misclassify [3].

Despite its value, human evaluation has notable limitations. It is expensive, time-consuming, and subject to inter-rater variability, which can obscure subtle diferences between systems. Additionally, annotator background and task framing can influence outcomes. For example, work has shown that crowdworker evaluations can difer systematically from domain-expert judgments, particularly on complex tasks like summarisation or question answering [30].

To compare the behaviour of the considered LaaJ systems with human judgments, we conducted a small experiment where three expert linguists manually evaluated 120 texts produced by Minerva-350M in response to the six prompts given in Table 3. The annotators were asked to evaluate the LLM-generated texts according to the three selected metrics: the instructions given to them was almost identical to the prompts in Tables 8, 9 and 10 we used for LaaJ. Table 4 (top-left section) shows the

Spearman Rank Correlation Coeficients (SRCC) between

the rankings provided by the three human annotators

A1-A3, who assigned scores on a 5-point Likert scale. The correlations were relatively low, highlighting the challenges human annotators face in consistently grading text production using similar criteria. Due to the low correlations observed, particularly in

the assessment of Fluency, we decided against using the human annotations to calibrate our LaaJ systems and chose to rely solely on the LaaJ methodology for the evaluations. 3.3.2. LLMs as Automated Judges of Text Quality

Recent advances LLMs have opened new avenues for evaluating textual outputs in NLP. Traditionally, the evaluation of text generation has relied heavily on human relevance. judgments, which, while high in fidelity, are costly, time- Several studies have investigated the reliability of consuming, and often inconsistent due to inter-annotator LLMs as automatic evaluators. For instance, G-Eval [32] variability [31]. In contrast, LLMs such as GPT-3/4, Palm highlights that LLMs can approximate human judgments and Gemini have demonstrated potential not only in gen- in multi-dimensional evaluation tasks when properly erating text but also in providing reliable meta-judgments prompted. As shown in the nice review by Li et al. [33], about language quality, including fluency, coherence, and it is possible to set up a framework where an LLM acts as A1 Flu.

Coh.

Rel.

A2 Flu.

Coh.

Rel.

A3 Flu.

Coh.

Rel.

Llama Flu.

Coh.

Rel.

A2 a zero-shot or few-shot judge, providing ordinal or scalar ratings that correlate highly with human annotations.

This correlation is particularly strong when the models are instructed explicitly to focus on specific dimensions of quality, such as grammatical fluency or semantic relevance.

In terms of Fluency, LLMs have internalised extensive grammatical structures through pretraining on large corpora, enabling them to efectively recognise and assess grammaticality and naturalness. For Coherence, models evaluate the logical consistency and flow of ideas across sentences or turns, especially when equipped with context windows that span multiple paragraphs. Evaluating

Relevance, the alignment of a response to a prompt or

topic, has also been shown to benefit from LLMs’ contextual awareness and knowledge grounding.

In summary, LLMs have emerged as credible tools for

evaluating textual quality across multiple dimensions: when applied with careful prompt design and interpretative caution, they can serve as scalable, cost-efective complements to human assessment. p-value CORISllm-350M

Flu Coh

Rel CORISllm-1B

Flu Coh

Rel Minerva-350M

Flu Coh

Rel Minerva-1B

Flu Coh Rel

In order to avoid any inconsistency introduced by hu- 600 new texts produced by CORISllm and Minerva mod

man judgments, we decided to rely only on two diferent els (300 for each model). Correlations are all quite high

LLMs for evaluating the quality of texts produced by and highly significant, thus we can reliably use these

CORISllm and Minerva models. automatic judges for evaluating the textual production

We adopted a powerful online LLM, namely Gemini- of the tested models. 2.0-flash through Google APIs, and an ofline, quantised

model, namely bartowski/Llama-3.3-70B-Instruct-Q6_K_L downloaded from the Huggingface repository3. 4. Results

Tables 8, 9 and 10 show the three prompts we have designed for asking the two LaaJ to evaluate, using a Tables 6 and 7 present the means and standard deviations 5-point Likert scale, Fluency, Coherence and Relevance of of the scores assigned by the two judges across the three the texts generated by CORISllm-350M/1B and Minerva- evaluation metrics to the 600 texts forming the evaluation 350M/1B. For designing these prompt we took inspiration dataset. The tables also display the results of a t-test from similar prompts proposed in G-Eval [32]. The sepa- for independent samples, which assesses the statistical rators ‘##SYSTEM##’, ‘##USER##’ and ’##ASSISTANT##’ significance of the diferences in means. for marking the three diferent blocks of information in Examining the evaluations provided by Gemini, we the prompts were replaced with empty lines for Gemini observe a notable increase in the scores assigned to prompts and with the appropriate separators for prompts CORISllm-350M compared to the equivalent Minervaproposed to the Llama judge. 350M model. Furthermore, the scores for it are so high

To assess the reliability of their judgments, we first that they are comparable, and not significantly diferent, evaluated the agreement between the two LaaJ systems to those of the larger Minerva-1B model, with the excepand the human annotators. Table 4 also reports the SRCC tion of the Relevance metric. With regard to CORISllmbetween each LaaJ and the human annotators. While the 1B, it performs much better than Minerva-350M, as extwo LaaJ systems show high mutual correlation, their pected, and more or less on par with Minerva-1B, exagreement with individual human annotators is lower, hibiting better performance on Fluency and worse on though still comparable to the level of agreement ob- Relevance. All diferences that are statistically significant served between human annotators themselves. This fur- are indicated by the asterisks next to the metric. ther supports the case for favoring LaaJ-generated anno- Regarding the Llama-3.3-70B judge, CORISllm-350M tations over those produced by humans. consistently receives significantly higher scores than the

Table 5 shows the (SRCCs) between Gemini-2.0-flash equivalent Minerva-350M model, and its scores are comand the quantised Llama-3.3-70B judges when evaluating parable to those of the Minerva-1B model across all metrics. Using this judge, CORISllm-1B performs much better than both Minerva models in a highly significant way, CORISllm-350M

Flu=2.74± 0.96 Coh=2.01± 0.80 Rel=1.97± 1.39 CORISllm-1B Flu=3.1± 0.98 Coh=2.18± 0.92 Rel=1.95± 1.35

← Flu* Coh*

Rel* ← Flu* Coh*

Rel* ↑ Flu Coh

Rel* Flu* ←

Coh Rel*

↑

5. Discussion & Conclusions When evaluating the textual production of equiva

In this study, we examined how the choice of data for lent models across Fluency, internal Coherence, and RelLLM pretraining afects performance, emphasizing the evance to the assigned task, CORISllm outperformed importance of using a representative corpus to enhance Minerva. Due to the limited dimensions of the training the quality of text produced by generative LLMs. corpus, suitable to pretrain 350M models and less 1B mod

Using the design framework of the CORIS corpus, a els, the results are more neat on smaller models. In any representative corpus of contemporary Italian, we pre- case, this points in the direction that using representative trained two LLMs following exactly the same process and balanced corpora for LLM pretraining has an impact used for the Minerva models [12]. However, instead of on performance. In our experiments, CORISllm-350M, the original dataset, we used a new 11.6 billion-token rep- despite having only one-third of the model parameters, resentative corpus specifically structured to align with performed nearly on par with Minerva-1B in terms of the CORIS macrovarieties. generative text quality.

##SYSTEM## Tu sei un linguista esperto nella valutazione dei testi. Ti verrà fornita la descrizione di un esercizio e lo svolgimento di questo esercizio da parte di un’AI.

Il tuo compito è valutare lo svolgimento in base a una metrica.

Assicurati di leggere e comprendere attentamente queste istruzioni. Tieni aperto questo documento durante la revisione e consultalo quando necessario.

Criteri di valutazione: Coerenza (1-5): la qualità globale di tutte le frasi. Il testo dovrebbe essere ben strutturato e ben organizzato. Il testo non dovrebbe contenere solo un mucchio di informazioni correlate, ma dovrebbe svilupparsi da una frase a un corpo coerente di informazioni su un argomento.

Fasi di valutazione: 1. Leggi attentamente lo svolgimento e identifica l’argomento principale e i punti chiave. 2. Analizza il contenuto di ogni frase e valuta se frasi successive sono legate logicamente e strutturalmente. 3. Assegna un punteggio per la coerenza su una scala da 1 a 5, dove 1 è il punteggio più basso e 5 il punteggio più alto in base ai Criteri di valutazione.

The goal of this work was not to create a complete

family of LLMs pretrained on representative corpora and ready for production deployment. Rather, we aimed to provide a proof-of-concept study that emphasises the need for greater attention to training corpora in order to develop better models.

While the openness of training data is certainly a valuable principle, the results presented here suggest that it is equally important to incorporate high-quality published texts into the training process in order to enhance performance without altering the transformer model. Since such materials are often protected by copyright, it is essential to establish specific agreements with publishers.

Due to copyright restrictions on portions of our pre

training corpus, we are unable to distribute it freely.

CORISllm models are available upon request.

##SYSTEM## Tu sei un linguista esperto nella valutazione dei testi. Ti verrà fornita la descrizione di un esercizio e lo svolgimento di questo esercizio da parte di un’AI.

Il tuo compito è valutare lo svolgimento in base a una metrica. Assicurati di leggere e comprendere attentamente queste istruzioni. Tieni aperto questo documento durante la revisione e consultalo quando necessario.

Criteri di valutazione: Rilevanza (1-5): Lo svolgimento deve includere solo informazioni allineate con la descrizione dell’esercizio. Dovrai penalizzare gli svolgimenti che contengono informazioni o argomenti non rilevanti rispetto alla descrizione. Fasi di valutazione: 1. Leggi attentamente lo svolgimento e identifica l’argomento principale e i punti chiave. 2. Confronta lo svolgimento con la descrizione dell’esercizio. 3. Assegna un punteggio di rilevanza da 1 a 5, dove 1 è il punteggio più basso e 5 il punteggio più alto in base ai Criteri di valutazione.

Acknowledgments

We acknowledge the CINECA4 award no. HP10CPJEG0 (project DARE4LLM) under the ISCRA initiative, for the availability of HPC resources and support. haylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. S. Purohit, U. S. Prashanth, E. Raf, A. Skowron,

Koura, A. Sridhar, T. Wang, L. Zettlemoyer, OPT: L. Sutawika, O. Van Der Wal, Pythia: A Suite for Open Pre-trained Transformer Language Models, Analyzing Large Language Models Across Training

2022. arXiv:2205.01068. and Scaling, in: A. Krause, E. Brunskill, K. Cho, [3] E. M. Bender, T. Gebru, A. McMillan-Major, B. Engelhardt, S. Sabato, J. Scarlett (Eds.), Proceed

S. Shmitchell, On the Dangers of Stochastic Par- ings of the 40th International Conference on Ma

rots: Can Language Models Be Too Big?, FAccT ’21, chine Learning, volume 202 of Proceedings of Ma

Association for Computing Machinery, New York, chine Learning Research, PMLR, 2023, pp. 2397–

NY, USA, 2021, p. 610–623. 2430. [4] J. Dodge, M. Sap, A. Marasović, W. Agnew, G. Il- [12] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S. Coharco, D. Groeneveld, M. Mitchell, M. Gardner, Doc- nia, E. Barba, S. Orlandini, G. Fiameni, R. Navigli, umenting large webtext corpora: A case study on Minerva LLMs: The first family of large language the colossal clean crawled corpus, in: M.-F. Moens, models trained from scratch on Italian data, in:

X. Huang, L. Specia, S. W.-t. Yih (Eds.), Proceed- F. Dell’Orletta, A. Lenci, S. Montemagni, R. Sprug

ings of the 2021 Conference on Empirical Methods noli (Eds.), Proceedings of the 10th Italian Conferin Natural Language Processing, Association for ence on Computational Linguistics (CLiC-it 2024),

Computational Linguistics, Online and Punta Cana, CEUR Workshop Proceedings, Pisa, Italy, 2024, pp.

Dominican Republic, 2021, pp. 1286–1305. 707–719. [5] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, [13] M. Weber, D. Y. Fu, Q. Anthony, Y. Oren, S. Adams,

C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, A. Alexandrov, X. Lyu, H. Nguyen, X. Yao, S. Presser, C. Leahy, The Pile: An 800GB Dataset V. Adams, B. Athiwaratkun, R. Chalamala, K. Chen,

of Diverse Text for Language Modeling, 2020. M. Ryabinin, T. Dao, P. Liang, C. Ré, I. Rish, arXiv:2101.00027. C. Zhang, RedPajama: an Open Dataset for Train[6] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, ing Large Language Models, NeurIPS Datasets and

M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the Benchmarks Track (2024).

limits of transfer learning with a unified text-to-text [14] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, transformer, J. Mach. Learn. Res. 21 (2020). H. Alobeidli, A. Cappelli, B. Pannier, E. Almazrouei, [7] A. Abid, M. Farooqi, J. Zou, Persistent Anti-Muslim J. Launay, The RefinedWeb Dataset for Falcon

Bias in Large Language Models, in: Proceedings LLM: Outperforming Curated Corpora with Web

of the 2021 AAAI/ACM Conference on AI, Ethics, Data Only, in: A. Oh, T. Naumann, A. Globerson, and Society, Association for Computing Machinery, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in

New York, NY, USA, 2021, p. 298–306. Neural Information Processing Systems, volume 36,

[8] L. Weidinger, J. Uesato, M. Rauh, C. Grifin, P.-S. Curran Associates, Inc., 2023, pp. 79155–79172.

Huang, J. Mellor, A. Glaese, M. Cheng, B. Balle, [15] T. Nguyen, C. V. Nguyen, V. D. Lai, H. Man, N. T. A. Kasirzadeh, C. Biles, S. Brown, Z. Kenton, Ngo, F. Dernoncourt, R. A. Rossi, T. H. Nguyen, W. Hawkins, T. Stepleton, A. Birhane, L. A. Hen- CulturaX: A cleaned, enormous, and multilingual

dricks, L. Rimell, W. Isaac, J. Haas, S. Legassick, dataset for large language models in 167 languages,

G. Irving, I. Gabriel, Taxonomy of Risks posed by in: N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, Language Models, in: Proceedings of the 2022 ACM S. Sakti, N. Xue (Eds.), Proceedings of the 2024 Conference on Fairness, Accountability, and Trans- Joint International Conference on Computational

parency, FAccT ’22, Association for Computing Ma- Linguistics, Language Resources and Evaluation chinery, New York, NY, USA, 2022, p. 214–229. (LREC-COLING 2024), ELRA and ICCL, Torino, [9] N. Brandizzi, H. Abdelwahab, A. Bhowmick, Italia, 2024, pp. 4226–4237.

L. Helmer, B. J. Stein, P. Denisov, Q. Saleem, [16] J. Egbert, D. Biber, B. Gray, Corpus Representa M. Fromm, M. Ali, R. Rutmann, F. Naderi, M. S. Agy, tiveness: A Conceptual and Methodological Frame A. Schwirjow, F. Küch, L. Hahn, M. Ostendorf, P. O. work, Cambridge University Press, 2022, p. 52–67. Suarez, G. Rehm, D. Wegener, N. Flores-Herr, J. Köh- [17] S. C. Herring, Computer-Mediated Discourse Anal

ler, J. Leveling, Data Processing for the OpenGPT-X ysis: An Approach to Researching Online Behavior,

Model Family, 2024. arXiv:2410.08800. Learning in Doing: Social, Cognitive and Compu[10] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hess- tational Perspectives, Cambridge University Press, low, many others., BLOOM: A 176B-Parameter 2004, p. 338–376. Open-Access Multilingual Language Model, 2023. [18] F. Tamburini, I corpora del FICLIT, Università di

arXiv:2211.05100. Bologna: CORIS/CODIS, BoLC e DiaCORIS, in: [11] S. Biderman, H. Schoelkopf, Q. G. Anthony, Proceedings of the LIV Congresso Internazionale

H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, di Studi della Società di Linguistica Italiana, 2022,

Declaration on Generative AI During the preparation of this work, the author(s) did not use any generative AI tools or services.

[1] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , J.

Kaplan , P.

Dhariwal , A.

Neelakantan , P.

Shyam , G.

Sastry , A.

Askell , S.

Agarwal , A.

Herbert-Voss , G. Krueger, T.

Henighan , R.

Child , A.

Ramesh , D. M.

Ziegler , J.

Wu , C.

Winter , C.

Hesse , M.

Chen , E. Sigler, M.

Litwin , S.

Gray , B.

Chess , J.

Clark , C.

Berner , S.

McCandlish , A.

Radford , I.

Sutskever , D.

Amodei , Language models are few-shot learners , in: Proceedings of the 34th International Conference on Neural Information Processing Systems , NIPS ' 20 , 2020 .

[2]

Zhang , S. Roller,

Goyal ,

Artetxe ,

Chen ,

Dewan ,

Diab ,

Li ,

X. V.

Lin ,

Mipp . 189 - 197 . ceedings, Pisa, Italy, 2024 , pp. 584 - 599 .

[19]

R. Rossini

Favretti ,

Tamburini , C. De Santis, [28]

Chang ,

Wang ,

Wu ,

Yang ,

Zhu , CORIS/CODIS: A corpus of written Italian based H . Chen , X.

Yi , C.

Wang , Y.

Wang , W.

Ye , Y. Zhang,

on a defined and a dynamic model , in: A. Wilson,

Chang ,

P. S.

Yu ,

Yang ,

Xie , A survey on P. Rayson, T. McEnery (Eds.), A Rainbow of Cor- evaluation of large language models , ACM Trans. pora: Corpus Linguistics and the Languages of the Intell. Syst. Technol . 15 ( 2024 ). World, Lincom-Europa, Munich, 2002 , pp. 27 - 38 . [29]

Liang ,

Bommasani ,

Lee ,

Tsipras ,

Soylu ,

[20]

Sinclair , Corpus, Concordance, Collocation, Ox- et al., Holistic evaluation of language models , Transford University Press, 1991 . actions on Machine Learning Research ( 2023 ).

[21]

L. D.

Mattei ,

Cafagna ,

Dell'Orletta ,

Nissim , [30]

Karpinska ,

Akoury ,

Iyyer , The perils of M. Guerini, Geppetto carves italian into a language using Mechanical Turk to evaluate open-ended text model , in: Proceedings of the 7th Italian Conference generation, in: Proceedings of the 2021 Conference on Computational Linguistics (CLiC-it 2020 ), CEUR on Empirical Methods in Natural Language ProWorkshop Proceedings , Bologna, Italy, 2020 . cessing, Association for Computational Linguistics,

[22]

Hofmann ,

Borgeaud ,

Mensch , Online and

Punta

Cana , Dominican Republic, 2021 , E. Buchatskaya,

Cai , E. Rutherford, pp. 1265 - 1285 . D. de Las Casas , L. A.

Hendricks , J. Welbl, [31] C. van der

Lee , A.

Gatt , E. van Miltenburg , E. KrahA. Clark, T.

Hennigan , E.

Noland , K.

Millican, mer, Human evaluation of automatically generated G . van den Driessche,

Damoc ,

Guy , S. Osin- text: Current trends and best practice guidelines, dero ,

Simonyan ,

Elsen ,

Vinyals ,

J. W.

Rae , Computer Speech & Language 67 ( 2021 ) 101151. L. Sifre , Training compute-optimal large language [32]

Liu ,

Iter ,

Xu ,

Wang ,

Xu ,

Zhu , G-Eval: models, in: Proceedings of the 36th International NLG Evaluation using Gpt-4 with Better Human Conference on Neural Information Processing Alignment , in: Proceedings of the 2023 Conference Systems, NIPS '22 , Curran Associates Inc., Red on Empirical Methods in Natural Language ProHook , NY, USA, 2022 . cessing, Association for Computational Linguistics,

[23]

Paperno ,

Kruszewski ,

Lazaridou ,

N. Q.

Singapore , 2023 , pp. 2511 - 2522 . Pham,

Bernardi ,

Pezzelle ,

Baroni , G. Boleda, [33]

Li ,

Dong ,

Chen ,

Su ,

Zhou ,

Ai ,

Fernández , The LAMBADA dataset: Word predic- Z. Ye , Y. Liu, LLMs-as- Judges : A Comprehensive tion requiring a broad discourse context , in: K. Erk, Survey on LLM-based Evaluation Methods , 2024 . N. A. Smith (Eds.), Proceedings of the 54th Annual arXiv:2412 .05579. Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Berlin, Germany, 2016 , pp. 1525 - 1534 .

[24]

Sakaguchi ,

R. Le

Bras ,

Bhagavatula ,

Choi , Winogrande: An adversarial winograd schema challenge at scale , in: Proceedings of the AAAI Conference on Artificial Intelligence , 2020 , pp. 8732 - 8740 .

[25]

Zellers ,

Holtzman ,

Bisk ,

Farhadi , Y. Choi, HellaSwag: Can a machine really finish your sentence?, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Florence, Italy, 2019 , pp. 4791 - 4800 .

[26]

Hendrycks ,

Burns ,

Basart ,

Zou ,

Mazeika ,

Song ,

Steinhardt , Measuring massive multitask language understanding , in: Proceedings of the International Conference on Learning Representations (ICLR) , 2021 .

[27]

Moroni ,

Conia ,

Martelli ,

Navigli , Itabench: Towards a more comprehensive evaluation for Italian LLMs , in: F. Dell'Orletta , A.

Lenci , S.

Montemagni , R. Sprugnoli (Eds.), Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024 ), CEUR Workshop Pro-