1. Introduction

BAMBI Goes to School: Evaluating Italian BabyLMs with Invalsi-ITA

Luca Capone

Alice Suozzi

Gianluca E. Lebani

1 2

Alessandro Lenci

0 0 CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica, Università di Pisa , Via Santa Maria 36, 56126 Pisa , Italy 1 European Centre for Living Technology (ECLT) , Ca' Bottacin, Dorsoduro 3911, 30123 Venice , Italy 2 QuaCLing Lab, Dipartimento di Studi Linguistici e Culturali Comparati, Università Ca' Foscari Venezia , Dorsoduro 1075, 30123 Venice , Italy

2025

This paper explores the impact of ecologically and cognitively plausible data on the training of language models. It builds on prior work [1, 2] integrating child-directed speech, curriculum learning and instruction tuning to train Italian BabyLMs. To evaluate our BabyLMs, we compare their performance (trained on fewer than 100M words using various techniques) with that of native Italian Large Language Models using the Invalsi-ITA [3] benchmark, designed to evaluate Italian students on text comprehension and linguistic abilities. The goal is to assess whether cognitively motivated training approaches (Curriculum Learning based on Child-Directed speech and child-friendly data), which are crucial for meaningful comparison between human learners and computational systems [4], yield greater eficiency than standard methods.

eol>Italian BabyLM Invalsi-ITA benchmark LM Evaluation Text Comprehension Italian Grammar

1. Introduction

petence [ 10, 11 ]. In addition to model size and training data volume, other plausibility criteria should be conEven though Language Models (LMs) have taken research sidered. These include the quality of the input (such as in linguistics and cognitive science by storm, their mean- child-directed speech) and the manner in which it is preingful application in these fields still faces significant sented, for instance through Curriculum Learning (CL). challenges. In order for LMs to be useful and informa- Moreover, the standard language modeling objective diftive for understanding language and cognition, several fers substantially from the discursive and interactive explausibility criteria must be met [ 5, 6, 7 ]. Among them, changes children engage in with adults and peers [ 4 ]. In the most important are the amount of input received short, approximating child language learning conditions during training and the number of trainable parameters. requires attention to multiple dimensions. A growing body of empirical evidence shows that be- This study aims at investigating the impact of such diyond a certain model size and amount of training data, mensions on LMs’ development of linguistic skills. Specifthe probability distributions generated by LMs diverge ically, we examine the efectiveness of training Italian from human-like patterns and become poor predictors BabyLMs using child-directed speech, curriculum learnof psycholinguistic measures, such as eye-tracking data ing, and instruction tuning—techniques inspired by hu[ 8, 9 ]. In contrast, smaller models trained on a limited man language acquisition to the purpose of assessing amount of data appear to align more closely with human whether these cognitively grounded methods lead to imreading strategies. This observation is consistent with proved performance compared to conventional training ifndings from the BabyLM Challenge, which demonstrate approaches, particularly when working with limited data. that models trained on child-directed speech and capped To this end, we evaluate our BabyLMs against native Italat 100 million words can achieve strong syntactic com- ian Large Language Models using the Invalsi-ITA benchmark, which is focused on text comprehension and linCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- guistic knowledge. tics, September 24 — 26, 2025, Cagliari, Italy The paper is structured as follows: first, an overview * Corresponding author. of related works is provided in Section 2. Section 3 is † sFpoornthsieblsepfeocrificSpeuctripoonsse3s.1o,f 3It.3alaiannd A3.c5a,dAelmicye, SLuuoczazCifaoproSneecitsiorne-2, dedicated to the description of the models’ evaluation. 3.2, and 3.4 Alessandro Lenci for Section 1 and Gianluca E. Lebani The models are presented in Section 3.1, whilst in Secfor Section 4. tions 3.2 and 3.3 the Invalsi-ITA benchmark, used for the $ luca.capone@fileli.unipi.it (L. Capone); alice.suozzi@unive.it evaluation, and the procedure followed to assess the mod(A. Suozzi); gianluca.lebani@unive.it (G. E. Lebani); els’ abilities are described. The results of the evaluation ales0s0a0n0d-0ro0.0le2n-1c8i@72u-6n9ip56i.i(tL(.AC. aLpeonncei)); 0000-0002-5215-7742 are detailed in Section 3.4 and discussed in Section 3.5. (A. Suozzi); 0000-0002-3588-1077 (G. E. Lebani); Finally, some conclusions are drawn in Section 4. 0000-0001-5790-4308 (A. Lenci) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License

Attribution 4.0 International (CC BY 4.0).

2. Related Works 3. Evaluating Text Comprehension and Grammatical Knowledge with Invalsi-ITA

Two lines of research are particularly relevant to our goals, as they represent two sides of the same coin: the ifrst focuses on the quality and quantity of training data necessary for BabyLMs to develop linguistic abilities; the 3.1. Models second concerns the evaluation of BabyLMs through the creation or adaptation of benchmarks originally designed The Bambi model is based on a lightweight GPT-2-style to assess the linguistic competence of human speakers. decoder architecture, with approximately 136 million pa

Regarding the first aspect, several studies have ex- rameters (Table 1). It is trained on a dataset composed of plored training models on datasets that are compara- transcripts of child-directed speech and multimedia ble—both in size and in linguistic nature—to the input typ- content designed for children [ 2 ]. So far, the dataset ically received by children during early development (e.g., is organized into three tiers of increasing linguistic com[ 12, 13, 14 ]). These works show that while a large vol- plexity, corresponding to the age ranges 0–6, 6–12, and ume of data is essential for achieving strong performance 12–18. An additional tier is currently in progress. For the on standard Natural Language Understanding tasks, a Bambi baseline model, all three tiers are used in a fully significantly smaller amount is suficient for acquiring shufled format. In contrast, the Bambi_CL (Curriculum core syntactic knowledge. In addition to data quantity Learning) model is trained on the tiers sequentially, proand quality, the importance of curriculum learning strate- gressing from the simplest to the most complex. Based gies and model architecture optimization has also been on both the base and CL models, Instruction Tuning highlighted [ 10 ]. (IT) variants are implemented (Table 2). The IT training

On the evaluation front, several benchmarks have been dataset comprises the following resources: developed over the years (e.g., [ 15, 16, 17 ]). While these • teelinsan/camoscio_cleaned : a translated benchmarks are efective tools for comparing models version [ 20 ] of the Stanford Alpaca dataset against each other, they are not well-suited for comparing [ 21 ], which consists of LM-generated instructionmodels to human language abilities, especially those of response pairs based on a seed set of humanchildren. Although some studies have directly addressed written prompts [ 22 ]. The dataset contains apthis gap (e.g., [ 18 ]), they have not yet produced large- proximately 50,000 items. scale, standardized benchmarks for this purpose. • massimilianowosz/gsm8k-it : a translated

For the Italian language, to the best of our knowledge, version of GSM8K [ 23 ], a dataset of 8.500 grade only two benchmarks currently enable both model-to- school-level math word problems. model and model-to-human comparisons. The first is BaBIEs [ 1 ], a benchmark derived from the adaptation of • dMaattatseitmoafx/ItDaAliTanA--lAanIg_uCaognevceornsvaetrisaotnio_nIsT,Aco:mafour standardized tests originally designed to assess the prising 10,000 items [24]. semantic and syntactic competence of Italian-speaking children. The second is Invalsi-ITA [ 3, 19 ], described in Section 3.2, which aims to evaluate text comprehension and linguistic abilities in Italian students from primary through high school.

In this study, we employ the Invalsi-ITA benchmark to evaluate various Bambi models, a series of Italian BabyLMs which difer from one another in terms of i.) the amount of training data, ii.) the type of training data and learning strategies adopted, and iii.) instruction tuning (cf. Section 3.1). This benchmark is particularly well-suited to our analysis, as it allows us to observe improvements or declines across school grades and to isolate which of the above three variables may be influencing such trends in performance.

For comparison purposes, the same architecture was

trained on a traditional dataset of equivalent size, using a random subset of mC4 [25], a corpus derived from the public Common Crawl web scrape and used to train standard LMs.

It is important to note that BabyLMs typically operate with limited input and output context windows, both to maintain model compactness and to respect cognitive plausibility constraints. In particular, the training data for the first and second developmental tiers avoid excessively long sequences. However, to enable evaluation on the Invalsi-ITA benchmark, the models were trained with a context window of 6,144 tokens, the minimum required to avoid truncating benchmark items. Crucially, our dataset remains untouched. The BabyLMs are compared against ifve other models (Tables 1 and 2). Minerva-3B is the model trained on the least amount of data, despite not being the smallest in size. It is followed by Minerva-7B and Minerva 7B-it, which rank second in terms of data volume [26]. Next is Velvet-2B, trained on approximately 3

Architecture

30,000 32,768 51,200 126,976 32,000 12x12 32x32 32x32 28x32 32x32

768 2,560 4,096 2,048 4,096

135,856,128 2,894,236,160 7,399,018,496 2,223,097,856 7,241,732,096 trillion tokens 1, and finally Cerbero-7B, for which the Invalsi-ITA focuses on the Italian language. It origamount of training data has not been disclosed by the inally included 1,264 questions, classified by [ 3 ] into: developers [27]. These models were chosen because their i.) multiple choice; ii.) binary (e.g., TRUE/FALSE); iii.) training corpora are predominantly in Italian. open-ended; iv.) other. The authors of the benchmark excluded categories (iii.) and (iv.) retaining only multiple 3.2. Invalsi-ITA choice (87.47%) and binary (14.33%) questions, for a total of 1,117 questions. The benchmark assesses two main Invalsi-ITA [ 3 ] is a benchmark derived from the adap- kinds of competence: text comprehension and linguistation of an established battery of assessments aimed at tic knowledge. Text comprehension items (930/1,117, gauging educational proficiency throughout Italy. 83.26% of the total) require students to read a text and an

The INVALSI (Istituto nazionale per la valutazione del swer related questions (e.g., Le prime tre righe del racconto sistema educativo di istruzione e di formazione ‘National parlano della vita di Polipetto nel suo ambiente. Quale Institute for the Evaluation of the Education and Training frase spiega in poche parole come viveva Polipetto? ‘The System’) tests have been administered to Italian students ifrst three lines of the story talk about Polipetto’s life since the 2005/2006 school year. These tests are designed in his environment. Which sentence briefly explains to monitor the students’ competence of Italian language how Polipetto lived?’), while language items (187/1,117, and Mathematics throughout their educational path. In- 16.74% of the total) assess knowledge of specific gramcreasingly complex tests are administered during primary matical rules (e.g., Indica in quale frase la parola “pietra” school (grades 2 and 5), middle school (grades 6 and 8) è usata in senso figurato, cioè non indica la pietra vera e and high school (grades 10 and 13). propria. ‘Indicate in which sentence the word “stone” is used figuratively, that is, it does not refer to an actual 1https://huggingface.co/Almawave/Velvet-2B stone.’).

Question Macro-Area Grade 2 Grade 5 Grade 6

Comprehension Semantics Syntax Morphology Phonology Pragmatics/Textuality Punctuation/Spelling Total 149 1 0 0 0 0 1 151

Figure 1 shows the accuracy obtained by all models in each grade, considering both the text comprehension and the linguistic items. The accuracy values for each model in each grade are reported in Table 4 (Appendix 4).

A similar accuracy pattern emerges across grades 2 3.3. Method to 10 (Figure 1,). Cerbero-7B consistently achieves the The items are presented to the models in a zero-shot highest accuracy, although its performance gradually setting. Each item consists of a text (when present), a declines over the grades. Minerva-7B and Minerva-7Bquestion that includes the list of multiple-choice options, it follow with slightly lower scores, showing peaks in and the answer, often represented only by the letter cor- grades 2 and 6, a pattern also observed in Velvet-2B. In responding to the correct choice. Prompts and expected contrast, Minerva-3B aligns more closely with the Bambi outputs are formatted using the following template (orig- models, which display the lowest accuracy throughout inally in Italian; a translation is provided here for clarity). these grades.

A diferent pattern emerges in grade 13: Bambi, Prompt: Bambi_it, and Bambi_mc4_it achieve the highest accuracy, alongside Velvet-2B. Slightly lower scores are Read the text and answer the question: obtained by the Minerva models, with Minerva-7B-it {text} still leading this group. Notably, Cerbero-7B’s perfor{question} mance drops significantly in this final grade. Focusing Completions: on the Bambi family, the strongest performances are overall exhibited by Bambi, Bambi_it, Bambi_CL_it, and • La risposta corretta è A: {answer_a} Bambi_mc4_it. • La risposta corretta è B: {answer_b} Let us now turn to the accuracy the models achieved in the text comprehension items, displayed in Figure 2. The accuracy values are reported in Table 5 (Appendix A). The

2Due to the limited number of items within each linguistic macro

area, we opted to group all linguistic items together for the analysis. ifgure shows that the accuracy values and patterns obAs a result, only comprehension and language items are discussed served for the comprehension items largely reflect those in Section 3.4. found in the overall analysis. Cerbero-7B consistently achieves the highest accuracy across grades 2 to 10 (with Bambi_CL_it reach a peak in accuracy exceeding 0.50, all values above 0.50, though gradually declining), while followed by Bambi_mc4_it. Overall, grades 2 and 6 apa marked drop is observed in grade 13. Across grades pear to be easier for some models, but challenging for 2 to 10, the Minerva models attain the second-highest others. Grade 13 is challenging for all models, as none of accuracy, with Minerva-7B-it performing best within the them provide a correct response. family, closely followed by Minerva-7B. As in the overall Finally, let us take a look at the accuracy achieved by analysis, the Bambi models perform poorly from grades the models in the two kinds of questions that compose the 2 to 10 but improve significantly in grade 13: Bambi, Invalsi-ITA benchmark, i.e., multiple choice and binary (a Bambi_it, and Bambi_mc4_it all exceed 0.50 accuracy in summary of the accuracy values achieved for binary and this grade. The same pattern is observed for Velvet-2B. multiple choice questions is reported in Table 7, given

A diferent trend is observed when considering only in Appendix A). The accuracies achieved for the binary the accuracy achieved with respect to language items, dis- questions are displayed in Figure 4. played in Figure 3. The accuracy values are reported in Ta- For binary questions, accuracy generally hovers ble 6 (Appendix A). Cerbero-7B, Velvet-2B, and Minerva- around or slightly above the expected chance level (0.5). 3B perform overall worse with respect to items specifi- Most models tend to perform better at the lower (grade cally targeting grammatical knowledge than they do in 2) and upper (grade 13) ends of the evaluation spectrum, text comprehension items. Minerva-7B and Minerva-7B- with a noticeable dip in performance across intermediit, on the contrary, achieve similar accuracies in both ate grades (5–10). Among the best-performing models, tasks, and perform better in this task in grades 2 and 6. Bambi_CL_it and Cerbero-7B achieve the highest accuAs for Bambi models, they difer from each other regard- racy at grade 2 (0.70 and 0.65, respectively). Minervaing the accuracy they achieve. In grade 2, only Bambi, 7B-it and Cerbero-7B show relatively stable performance Bambi_mc4, and Bambi_mc4_it achieve the highest ac- across grade levels, with only minor fluctuations. Nocuracy (0.50) of all grades, whereas the others do not tably, Bambi_CL_it performs comparably to larger modprovide any correct answer in this grade. In grade 5 the els. same three Bambi models perform slightly better than Multiple choice questions (Figure 5) appear to be more Minerva-3B and Velvet-2B. In grade 6 Bambi_CL and challenging for all models. Given the four-alternative format, chance accuracy is approximately 0.25, and most particularly in early grades. However, models that commodels perform only marginally above this baseline. Still, bine both strategies, such as Bambi_CL_it, show more some models demonstrate steady improvement across consistent improvements, especially compared to IT-only grade levels, particularly Velvet-2B and Cerbero-7B. The variants. This is particularly evident in the case of the lanlatter stands out as the most consistent and accurate per- guage items. The pattern implies that CL may enhance former in this task, achieving scores in the range 0.53 to a model’s capacity for subsequent learning, making IT 0.56 across several grades and peaking at 0.625 in grade more efective. This finding aligns with insights from 13. Bambi models, on the contrary, seem to find this human developmental learning, where structured prokind of questions more challenging, particularly con- gression lays the groundwork for improved adaptability sidering grades 2 to 10. However, Bambi, Bambi_CL_it, and generalization over time 4. and Bambi_mc4 exceed the above-chance level in var- These results give rise to some puzzling observations ious grades. In particular, the performance of Bambi, that merit closer examination. For instance, when comBambi_it, and Bambi_mc4 peaks at grade 13, reaching an paring the Bambi models with their mc4-trained counaccuracy around 0.40. terparts, substantial diferences appear only in grades 2 (although this grade includes only two items) and 6 of the 3.5. Discussion language items. This prompts the question of whether using ecologically plausible data is as crucial as often asThe Invalsi-ITA benchmark appears to be challenging sumed, or if standard training corpora, such as mc4, can for all the models under investigation, as none of them produce comparable results. In fact, the Bambi_mc4 modexceed an accuracy value of 0.60. It should be kept in els perform comparably to other Bambi models in many mind, however, that Invalsi tests are also challenging for settings, indicating that the choice of data alone does not Italian students [ 3 ]. 3. yeld substantial diference. However, they do not clearly

The larger models, i.e., Cerbero-7B, Minerva-7B and outperform the Bambi models either: they achieve their Minerva-7B-it, perform overall better in this benchmark, best relative result in grade 5 of the language items, but especially when they are instruction-tuned. The reason in all other grades and tasks they perform worse or at may lie in the nature of Invalsi-ITA. This benchmark con- best match the level of at least one of the Bambi variants. sists indeed of text comprehension items and language This pattern suggests that while web training data can apitems, which specifically address normative grammatical proximate the results of carefully curated child-directed rules, instead of the models’ linguistic competence tout- speech to some extent, it does not consistently provide court. Naturally, models which are exposed to a larger an advantage, highlighting the need for a deeper analysis amount of training data and, even more importantly, to a of the interactions between data quality, structure, and large amount of written data, may be facilitated in these curriculum learning. kinds of tasks, either because they have been exposed to Another notable result is the unexpected jump in perthe actual texts used in the benchmark, or because they formance for the Bambi_CL models in grade 6 with reare more used to this kind of linguistic input. spect to the language items. One possible explanation

Nonetheless, Bambi models exhibit a great improve- lies in the CL learning strategy: although the total numment in grade 13 with respect to the text comprehension ber of tokens processed by these models over multiple items, and some of them perform comparably to larger epochs approaches the lifetime exposure of an 18-yearmodels with respect to language items (e.g, in grades old adolescent, the absolute size of the Bambi dataset 2 and 6). These results suggest that compact models, more closely reflects the typical linguistic input of a child despite lacking comprehensive world knowledge, can aged six to eight. This alignment may account for the develop robust grammatical knowledge at early stages relatively strong results in grade 6, which corresponds of training. Furthermore, considering binary questions, to the final portion of the training curriculum. Howmost of them, particularly Bambi_CL_it, Bambi_mc4 and ever, this interpretation does not readily explain another Bambi_mc4_it, perform comparably to larger models in surprising outcome: in the text comprehension task for specific grades despite their compact size and training grade 13, the Bambi and Bambi_mc4 models outperform constraints, suggesting the potential benefits of a combi- not only Bambi_CL and Bambi_CL_it, but also larger nation of oral and written training data. models like Minerva and Cerbero-7B. This could be an

Turning to curriculum learning and instruction tun- artifact of the limited number of items in this grade, but ing, a closer examination of the diferent Bambi models it highlights an area where further investigation is warindicates that each strategy contributes modest gains, ranted to understand how data composition, curriculum 3Unfortunately, the benchmark does not provide student-level data. 4We acknowledge the importance of cross-linguistic validation. To However, the paper describing the original resource [ 3 ] includes this end, we have submitted a related study to the third BabyLM a bar plot illustrating the performance gap, which highlights the Challenge [28], which is currently under review. Preliminary results challenges faced by Italian students. on English show a similar trend. pacing, and task type interact in shaping model behavior.

Taken together, these findings highlight several key insights. First, larger model size alone does not guarantee superior performance: smaller models can be competitive in specific cases, particularly in structurally simpler tasks.

Second, apparently, training strategies such as CL and IT yeld efective improvements only under specific evaluation conditions. Finally, the performance gap between BabyLM and LLM remains substantial, particularly in tasks requiring semantic depth understanding or world knowledge. Closing this gap without compromising cognitive and linguistic plausibility remains a key challenge.

Future work will need to explore new training strategies. and evaluation frameworks to address it.

4. Conclusion In this work, we presented an evaluation of six Bambi model variants alongside five larger models, using the Invalsi-ITA benchmark, which assesses text comprehension and linguistic abilities.

This evaluation revealed that larger models are facilitated in the text comprehension task, because either they have already encountered the texts used in the benchmark or they are more used to this kind of linguistic input. Nonetheless, smaller but more cognitively plausible models appear to be facilitated in the learning and generalization processes, as highlighted by their improvement in higher grades considering both text comprehension and language items.

Acknowledgments We acknowledge financial support under the PRIN

2022 Project Title "Computational and linguistic benchmarks for the study of verb argument structure" – CUP I53D23004050006 - Grant Assignment Decree No. 1016, 07/07/2023 by the Italian Ministry of University and Research (MUR), funded by the European Commission under the NextGeneration EU programme. This research was also partly funded by PNRR—M4C2—Investimento 1.3, Partenariato Esteso PE00000013—“FAIR—Future Artiifcial Intelligence Research”—Spoke 1 “Human-centered AI,” funded by the European Commission under the NextGeneration EU programme.

A. Appendix A: Accuracy Values for Invalsi-ITA

Declaration on Generative AI

[1]

Capone ,

Suozzi ,

G. E.

Lebani ,

Lenci , et al., BaBIEs: A Benchmark for the Linguistic Evaluation of Italian Baby Language Models , in: Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024 ), 2024 .

[2]

Suozzi ,

Capone ,

G. E.

Lebani ,

Lenci , BAMBI: Developing BAby language Models for Italian, Lingue e linguaggio, Rivista semestrale ( 2025 ) 83 - 102 .

[3]

Puccetti ,

Cassese ,

Esuli , The invalsi benchmarks: measuring the linguistic and mathematical understanding of large language models in italian , in: Proceedings of the 31st International Conference on Computational Linguistics , 2025 , pp. 6782 - 6797 .

[4]

E. G.

Wilcox ,

M. Y.

Hu ,

Mueller ,

Warstadt ,

Choshen ,

Zhuang ,

Williams ,

Cotterell , T. Linzen, Bigger is not always better: The importance of human-scale language modeling for psycholinguistics , Journal of Memory and Language 144 ( 2025 ) 104650 .

[5]

Warstadt ,

S. R.

Bowman , What artificial neural networks can tell us about human language acquisition, in: Algebraic structures in natural language , 2022 , pp. 17 - 60 .

[6]

Lenci , Understanding natural language understanding systems , Sistemi intelligenti 35 ( 2023 ) 277 - 302 .

[7]

Connell ,

Lynott , What can language models tell us about human cognition? , Current Directions in Psychological Science 33 ( 2024 ) 181 - 189 .

[8]

B.-D.

Oh ,

Schuler , Why does surprisal from larger transformer-based language models provide a poorer fit to human reading times? , Transactions of the Association for Computational Linguistics 11 ( 2023 ) 336 - 350 .

[9] A. De Varda , M. Marelli , Scaling in cognitive modelling: A multilingual approach to human reading times, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2 : Short

Papers)

, 2023 , pp. 139 - 149 .

[10]

Warstadt ,

Mueller ,

Choshen ,

Wilcox ,

Zhuang ,

Ciro ,

Mosquera ,

Paranjabe ,

Williams ,

Linzen , et al., Findings of the BabyLM Challenge: Sample-eficient pretraining on developmentally plausible corpora , in: Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning , 2023 , pp. 1 - 34 .

[11]

M. Y.

Hu ,

Mueller ,

Ross ,

Williams ,

Linzen ,

Zhuang ,

Cotterell ,

Choshen ,

Warstadt ,

E. G.

Wilcox , Findings of the second BabyLM challenge: Sample-eficient pretraining on developmentally plausible corpora , in: The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning , 2024 , pp. 1 - 21 .

[12]

Zhang ,

Warstadt ,

H.-S.

Li ,

S. R.

Bowman , When Do You Need Billions of Words of Pretraining Data? , in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Nat- H . Jun,

Kaiser ,

Plappert ,

Tworek , J. Hilton, ural Language Processing (Volume 1 : Long

Papers)

Nakano ,

Hesse ,

Schulman , Training veri2021, pp. 1112 - 1125 . ifers to solve math word problems, arXiv preprint

[13]

P. A.

Huebner ,

Sulem ,

Cynthia ,

Roth , Baby- arXiv: 2110 .14168 ( 2021 ). BERTa: Learning more grammar with small-scale [24] Mattimax, Italian conversations dataset by m.inc, child-directed language , in: Proceedings of the 2025 . URL: https://huggingface.co/datasets/ 25th conference on computational natural language Mattimax/DATA-AI_Conversation_ITA, dataset learning , 2021 , pp. 624 - 646 . of over 10 , 000 prompt-response pairs in Italian,

[14]

Wei ,

Tay ,

Bommasani ,

Rafel ,

Zoph , released by M.INC for training language models . S. Borgeaud,

Yogatama ,

Bosma ,

Zhou , [25]

Xue ,

Constant ,

Roberts ,

Kale ,

Al-Rfou ,

Metzler ,

E. H.

Chi ,

Hashimoto ,

Vinyals ,

Siddhant ,

Barua ,

Rafel , mT5: A massively P. Liang , J.

Dean , W.

Fedus , Emergent Abilities of multilingual pre-trained text-to-text transformer , Large Language Models, Transactions on Machine in: Proceedings of the 2021 Conference of the North Learning Research ( 2022 ). American Chapter of the Association for Computa-

[15]

Wang ,

Singh ,

Michael ,

Hill , O. Levy , tional Linguistics: Human Language Technologies ,

S. R.

Bowman , GLUE: A multi-task benchmark and 2021 , pp. 483 - 498 . analysis platform for natural language understand- [26]

Orlando ,

Moroni ,

P.-L. H.

Cabot , S. Conia, ing, in: Proceedings of the 2018 EMNLP Workshop E . Barba,

Orlandini , G. Fiameni, R. Navigli, MinBlackboxNLP: Analyzing and Interpreting Neural erva llms: The first family of large language models Networks for NLP, 2018 , pp. 353 - 355 . trained from scratch on italian data , in: Proceedings

[16]

Wang ,

Pruksachatkun ,

Nangia ,

Singh , of the Tenth Italian Conference on Computational J . Michael , F.

Hill , O.

Levy , S.

Bowman , Superglue: Linguistics (CLiC-it 2024 ), 2024 , pp. 707 - 719 . A stickier benchmark for general-purpose language [27]

F. A.

Galatolo ,

M. G.

Cimino , Cerbero-7B: A Leap understanding systems , Advances in neural infor- Forward in Language-Specific LLMs Through Enmation processing systems 32 ( 2019 ). hanced Chat Corpus Generation and Evaluation,

[17]

Warstadt ,

Parrish ,

Liu ,

Mohananey , arXiv preprint arXiv: 2311 .15698 ( 2023 ). W. Peng,

S.-F.

Wang ,

S. R.

Bowman , BLiMP: The [28]

Charpentier ,

Choshen ,

Cotterell ,

M. O.

Gul , Benchmark of Linguistic Minimal Pairs for English,

Hu ,

Jumelet ,

Linzen ,

Liu ,

Mueller , Transactions of the Association for Computational C . Ross , et al., BabyLM Turns 3: Call for papers Linguistics 8 ( 2020 ) 377 - 392 . for the 2025 BabyLM workshop, arXiv preprint

[18]

Evanson ,

Lakretz ,

J.-R.

King , Language ac- arXiv: 2502 .10645 ( 2025 ). quisition: do children and language models follow similar learning stages? , in: Findings of the Association for Computational Linguistics: ACL 2023 , 2023 , pp. 12205 - 12218 .

[19]

Mercorio ,

Mezzanzanica ,

Potertì ,

Serino ,

Seveso , Disce aut Deficere: Evaluating LLMs Proifciency on the INVALSI Italian Benchmark , arXiv preprint arXiv:2406.175352 ( 2024 ).

[20]

Santilli , E. Rodolà, Camoscio: an Italian Instruction-tuned LLaMA , in: Proceedings of the Nineth Italian Conference on Computational Linguistics (CLiC-it 2023 ), 2023 , pp. 385 - 395 .

[21]

Taori , I. Gulrajani,

Zhang ,

Dubois ,

Li ,

Guestrin ,

Liang , T. B. Hashimoto , Alpaca: A strong, replicable instruction-following model , Stanford Center for Research on Foundation Models 3 ( 2023 ) 7 .

[22]

Wang ,

Kordi ,

Mishra ,

Liu ,

N. A.

Smith ,

Khashabi ,

Hajishirzi , Self-instruct: Aligning language models with self-generated instructions , in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, 2023 , pp. 13484 - 13508 .

[23]

Cobbe ,

Kosaraju ,

Bavarian , M. Chen,