<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BAMBI Goes to School: Evaluating Italian BabyLMs with Invalsi-ITA</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Capone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alice Suozzi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianluca E. Lebani</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Lenci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica, Università di Pisa</institution>
          ,
          <addr-line>Via Santa Maria 36, 56126 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>European Centre for Living Technology (ECLT)</institution>
          ,
          <addr-line>Ca' Bottacin, Dorsoduro 3911, 30123 Venice</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>QuaCLing Lab, Dipartimento di Studi Linguistici e Culturali Comparati, Università Ca' Foscari Venezia</institution>
          ,
          <addr-line>Dorsoduro 1075, 30123 Venice</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper explores the impact of ecologically and cognitively plausible data on the training of language models. It builds on prior work [1, 2] integrating child-directed speech, curriculum learning and instruction tuning to train Italian BabyLMs. To evaluate our BabyLMs, we compare their performance (trained on fewer than 100M words using various techniques) with that of native Italian Large Language Models using the Invalsi-ITA [3] benchmark, designed to evaluate Italian students on text comprehension and linguistic abilities. The goal is to assess whether cognitively motivated training approaches (Curriculum Learning based on Child-Directed speech and child-friendly data), which are crucial for meaningful comparison between human learners and computational systems [4], yield greater eficiency than standard methods.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Italian BabyLM</kwd>
        <kwd>Invalsi-ITA benchmark</kwd>
        <kwd>LM Evaluation</kwd>
        <kwd>Text Comprehension</kwd>
        <kwd>Italian Grammar</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        petence [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. In addition to model size and training
data volume, other plausibility criteria should be
conEven though Language Models (LMs) have taken research sidered. These include the quality of the input (such as
in linguistics and cognitive science by storm, their mean- child-directed speech) and the manner in which it is
preingful application in these fields still faces significant sented, for instance through Curriculum Learning (CL).
challenges. In order for LMs to be useful and informa- Moreover, the standard language modeling objective
diftive for understanding language and cognition, several fers substantially from the discursive and interactive
explausibility criteria must be met [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ]. Among them, changes children engage in with adults and peers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In
the most important are the amount of input received short, approximating child language learning conditions
during training and the number of trainable parameters. requires attention to multiple dimensions.
A growing body of empirical evidence shows that be- This study aims at investigating the impact of such
diyond a certain model size and amount of training data, mensions on LMs’ development of linguistic skills.
Specifthe probability distributions generated by LMs diverge ically, we examine the efectiveness of training Italian
from human-like patterns and become poor predictors BabyLMs using child-directed speech, curriculum
learnof psycholinguistic measures, such as eye-tracking data ing, and instruction tuning—techniques inspired by
hu[
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. In contrast, smaller models trained on a limited man language acquisition to the purpose of assessing
amount of data appear to align more closely with human whether these cognitively grounded methods lead to
imreading strategies. This observation is consistent with proved performance compared to conventional training
ifndings from the BabyLM Challenge, which demonstrate approaches, particularly when working with limited data.
that models trained on child-directed speech and capped To this end, we evaluate our BabyLMs against native
Italat 100 million words can achieve strong syntactic com- ian Large Language Models using the Invalsi-ITA
benchmark, which is focused on text comprehension and
linCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- guistic knowledge.
tics, September 24 — 26, 2025, Cagliari, Italy The paper is structured as follows: first, an overview
* Corresponding author. of related works is provided in Section 2. Section 3 is
† sFpoornthsieblsepfeocrificSpeuctripoonsse3s.1o,f 3It.3alaiannd A3.c5a,dAelmicye, SLuuoczazCifaoproSneecitsiorne-2, dedicated to the description of the models’ evaluation.
3.2, and 3.4 Alessandro Lenci for Section 1 and Gianluca E. Lebani The models are presented in Section 3.1, whilst in
Secfor Section 4. tions 3.2 and 3.3 the Invalsi-ITA benchmark, used for the
$ luca.capone@fileli.unipi.it (L. Capone); alice.suozzi@unive.it evaluation, and the procedure followed to assess the
mod(A. Suozzi); gianluca.lebani@unive.it (G. E. Lebani); els’ abilities are described. The results of the evaluation
ales0s0a0n0d-0ro0.0le2n-1c8i@72u-6n9ip56i.i(tL(.AC. aLpeonncei)); 0000-0002-5215-7742 are detailed in Section 3.4 and discussed in Section 3.5.
(A. Suozzi); 0000-0002-3588-1077 (G. E. Lebani); Finally, some conclusions are drawn in Section 4.
0000-0001-5790-4308 (A. Lenci)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License
      </p>
      <p>Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works 3. Evaluating Text Comprehension and Grammatical Knowledge with Invalsi-ITA</title>
      <p>Two lines of research are particularly relevant to our
goals, as they represent two sides of the same coin: the
ifrst focuses on the quality and quantity of training data
necessary for BabyLMs to develop linguistic abilities; the 3.1. Models
second concerns the evaluation of BabyLMs through the
creation or adaptation of benchmarks originally designed The Bambi model is based on a lightweight GPT-2-style
to assess the linguistic competence of human speakers. decoder architecture, with approximately 136 million
pa</p>
      <p>
        Regarding the first aspect, several studies have ex- rameters (Table 1). It is trained on a dataset composed of
plored training models on datasets that are compara- transcripts of child-directed speech and multimedia
ble—both in size and in linguistic nature—to the input typ- content designed for children [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. So far, the dataset
ically received by children during early development (e.g., is organized into three tiers of increasing linguistic
com[
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ]). These works show that while a large vol- plexity, corresponding to the age ranges 0–6, 6–12, and
ume of data is essential for achieving strong performance 12–18. An additional tier is currently in progress. For the
on standard Natural Language Understanding tasks, a Bambi baseline model, all three tiers are used in a fully
significantly smaller amount is suficient for acquiring shufled format. In contrast, the Bambi_CL (Curriculum
core syntactic knowledge. In addition to data quantity Learning) model is trained on the tiers sequentially,
proand quality, the importance of curriculum learning strate- gressing from the simplest to the most complex. Based
gies and model architecture optimization has also been on both the base and CL models, Instruction Tuning
highlighted [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. (IT) variants are implemented (Table 2). The IT training
      </p>
      <p>
        On the evaluation front, several benchmarks have been dataset comprises the following resources:
developed over the years (e.g., [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17</xref>
        ]). While these • teelinsan/camoscio_cleaned : a translated
benchmarks are efective tools for comparing models version [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] of the Stanford Alpaca dataset
against each other, they are not well-suited for comparing [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], which consists of LM-generated
instructionmodels to human language abilities, especially those of response pairs based on a seed set of
humanchildren. Although some studies have directly addressed written prompts [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. The dataset contains
apthis gap (e.g., [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]), they have not yet produced large- proximately 50,000 items.
scale, standardized benchmarks for this purpose. • massimilianowosz/gsm8k-it : a translated
      </p>
      <p>
        For the Italian language, to the best of our knowledge, version of GSM8K [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], a dataset of 8.500 grade
only two benchmarks currently enable both model-to- school-level math word problems.
model and model-to-human comparisons. The first is
BaBIEs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a benchmark derived from the adaptation of •
dMaattatseitmoafx/ItDaAliTanA--lAanIg_uCaognevceornsvaetrisaotnio_nIsT,Aco:mafour standardized tests originally designed to assess the prising 10,000 items [24].
semantic and syntactic competence of Italian-speaking
children. The second is Invalsi-ITA [
        <xref ref-type="bibr" rid="ref19 ref3">3, 19</xref>
        ], described in
Section 3.2, which aims to evaluate text comprehension
and linguistic abilities in Italian students from primary
through high school.
      </p>
      <p>In this study, we employ the Invalsi-ITA benchmark
to evaluate various Bambi models, a series of Italian
BabyLMs which difer from one another in terms of i.)
the amount of training data, ii.) the type of training
data and learning strategies adopted, and iii.) instruction
tuning (cf. Section 3.1). This benchmark is particularly
well-suited to our analysis, as it allows us to observe
improvements or declines across school grades and to
isolate which of the above three variables may be
influencing such trends in performance.</p>
      <sec id="sec-2-1">
        <title>For comparison purposes, the same architecture was</title>
        <p>trained on a traditional dataset of equivalent size, using
a random subset of mC4 [25], a corpus derived from
the public Common Crawl web scrape and used to train
standard LMs.</p>
        <p>It is important to note that BabyLMs typically operate
with limited input and output context windows, both
to maintain model compactness and to respect cognitive
plausibility constraints. In particular, the training data for
the first and second developmental tiers avoid excessively
long sequences. However, to enable evaluation on the
Invalsi-ITA benchmark, the models were trained with a
context window of 6,144 tokens, the minimum required to
avoid truncating benchmark items. Crucially, our dataset
remains untouched. The BabyLMs are compared against
ifve other models (Tables 1 and 2). Minerva-3B is the
model trained on the least amount of data, despite not
being the smallest in size. It is followed by Minerva-7B and
Minerva 7B-it, which rank second in terms of data
volume [26]. Next is Velvet-2B, trained on approximately 3</p>
        <sec id="sec-2-1-1">
          <title>Architecture</title>
          <p>30,000
32,768
51,200
126,976
32,000
12x12
32x32
32x32
28x32
32x32</p>
          <p>768
2,560
4,096
2,048
4,096</p>
          <p>
            135,856,128
2,894,236,160
7,399,018,496
2,223,097,856
7,241,732,096
trillion tokens 1, and finally Cerbero-7B, for which the Invalsi-ITA focuses on the Italian language. It
origamount of training data has not been disclosed by the inally included 1,264 questions, classified by [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] into:
developers [27]. These models were chosen because their i.) multiple choice; ii.) binary (e.g., TRUE/FALSE); iii.)
training corpora are predominantly in Italian. open-ended; iv.) other. The authors of the benchmark
excluded categories (iii.) and (iv.) retaining only multiple
3.2. Invalsi-ITA choice (87.47%) and binary (14.33%) questions, for a total
of 1,117 questions. The benchmark assesses two main
Invalsi-ITA [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] is a benchmark derived from the adap- kinds of competence: text comprehension and
linguistation of an established battery of assessments aimed at tic knowledge. Text comprehension items (930/1,117,
gauging educational proficiency throughout Italy. 83.26% of the total) require students to read a text and
an
          </p>
          <p>The INVALSI (Istituto nazionale per la valutazione del swer related questions (e.g., Le prime tre righe del racconto
sistema educativo di istruzione e di formazione ‘National parlano della vita di Polipetto nel suo ambiente. Quale
Institute for the Evaluation of the Education and Training frase spiega in poche parole come viveva Polipetto? ‘The
System’) tests have been administered to Italian students ifrst three lines of the story talk about Polipetto’s life
since the 2005/2006 school year. These tests are designed in his environment. Which sentence briefly explains
to monitor the students’ competence of Italian language how Polipetto lived?’), while language items (187/1,117,
and Mathematics throughout their educational path. In- 16.74% of the total) assess knowledge of specific
gramcreasingly complex tests are administered during primary matical rules (e.g., Indica in quale frase la parola “pietra”
school (grades 2 and 5), middle school (grades 6 and 8) è usata in senso figurato, cioè non indica la pietra vera e
and high school (grades 10 and 13). propria. ‘Indicate in which sentence the word “stone” is
used figuratively, that is, it does not refer to an actual
1https://huggingface.co/Almawave/Velvet-2B stone.’).</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Question Macro-Area Grade 2 Grade 5 Grade 6</title>
          <p>Comprehension
Semantics
Syntax
Morphology
Phonology
Pragmatics/Textuality
Punctuation/Spelling
Total
149
1
0
0
0
0
1
151</p>
          <p>Figure 1 shows the accuracy obtained by all models in
each grade, considering both the text comprehension and
the linguistic items. The accuracy values for each model
in each grade are reported in Table 4 (Appendix 4).</p>
          <p>A similar accuracy pattern emerges across grades 2
3.3. Method to 10 (Figure 1,). Cerbero-7B consistently achieves the
The items are presented to the models in a zero-shot highest accuracy, although its performance gradually
setting. Each item consists of a text (when present), a declines over the grades. Minerva-7B and
Minerva-7Bquestion that includes the list of multiple-choice options, it follow with slightly lower scores, showing peaks in
and the answer, often represented only by the letter cor- grades 2 and 6, a pattern also observed in Velvet-2B. In
responding to the correct choice. Prompts and expected contrast, Minerva-3B aligns more closely with the Bambi
outputs are formatted using the following template (orig- models, which display the lowest accuracy throughout
inally in Italian; a translation is provided here for clarity). these grades.</p>
          <p>A diferent pattern emerges in grade 13: Bambi,
Prompt: Bambi_it, and Bambi_mc4_it achieve the highest
accuracy, alongside Velvet-2B. Slightly lower scores are
Read the text and answer the question: obtained by the Minerva models, with Minerva-7B-it
{text} still leading this group. Notably, Cerbero-7B’s
perfor{question} mance drops significantly in this final grade. Focusing
Completions: on the Bambi family, the strongest performances are
overall exhibited by Bambi, Bambi_it, Bambi_CL_it, and
• La risposta corretta è A: {answer_a} Bambi_mc4_it.
• La risposta corretta è B: {answer_b} Let us now turn to the accuracy the models achieved in
the text comprehension items, displayed in Figure 2. The
accuracy values are reported in Table 5 (Appendix A). The</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2Due to the limited number of items within each linguistic macro</title>
        <p>area, we opted to group all linguistic items together for the analysis. ifgure shows that the accuracy values and patterns
obAs a result, only comprehension and language items are discussed served for the comprehension items largely reflect those
in Section 3.4. found in the overall analysis. Cerbero-7B consistently
achieves the highest accuracy across grades 2 to 10 (with Bambi_CL_it reach a peak in accuracy exceeding 0.50,
all values above 0.50, though gradually declining), while followed by Bambi_mc4_it. Overall, grades 2 and 6
apa marked drop is observed in grade 13. Across grades pear to be easier for some models, but challenging for
2 to 10, the Minerva models attain the second-highest others. Grade 13 is challenging for all models, as none of
accuracy, with Minerva-7B-it performing best within the them provide a correct response.
family, closely followed by Minerva-7B. As in the overall Finally, let us take a look at the accuracy achieved by
analysis, the Bambi models perform poorly from grades the models in the two kinds of questions that compose the
2 to 10 but improve significantly in grade 13: Bambi, Invalsi-ITA benchmark, i.e., multiple choice and binary (a
Bambi_it, and Bambi_mc4_it all exceed 0.50 accuracy in summary of the accuracy values achieved for binary and
this grade. The same pattern is observed for Velvet-2B. multiple choice questions is reported in Table 7, given</p>
        <p>
          A diferent trend is observed when considering only in Appendix A). The accuracies achieved for the binary
the accuracy achieved with respect to language items, dis- questions are displayed in Figure 4.
played in Figure 3. The accuracy values are reported in Ta- For binary questions, accuracy generally hovers
ble 6 (Appendix A). Cerbero-7B, Velvet-2B, and Minerva- around or slightly above the expected chance level (0.5).
3B perform overall worse with respect to items specifi- Most models tend to perform better at the lower (grade
cally targeting grammatical knowledge than they do in 2) and upper (grade 13) ends of the evaluation spectrum,
text comprehension items. Minerva-7B and Minerva-7B- with a noticeable dip in performance across
intermediit, on the contrary, achieve similar accuracies in both ate grades (5–10). Among the best-performing models,
tasks, and perform better in this task in grades 2 and 6. Bambi_CL_it and Cerbero-7B achieve the highest
accuAs for Bambi models, they difer from each other regard- racy at grade 2 (0.70 and 0.65, respectively).
Minervaing the accuracy they achieve. In grade 2, only Bambi, 7B-it and Cerbero-7B show relatively stable performance
Bambi_mc4, and Bambi_mc4_it achieve the highest ac- across grade levels, with only minor fluctuations.
Nocuracy (0.50) of all grades, whereas the others do not tably, Bambi_CL_it performs comparably to larger
modprovide any correct answer in this grade. In grade 5 the els.
same three Bambi models perform slightly better than Multiple choice questions (Figure 5) appear to be more
Minerva-3B and Velvet-2B. In grade 6 Bambi_CL and challenging for all models. Given the four-alternative
format, chance accuracy is approximately 0.25, and most particularly in early grades. However, models that
commodels perform only marginally above this baseline. Still, bine both strategies, such as Bambi_CL_it, show more
some models demonstrate steady improvement across consistent improvements, especially compared to IT-only
grade levels, particularly Velvet-2B and Cerbero-7B. The variants. This is particularly evident in the case of the
lanlatter stands out as the most consistent and accurate per- guage items. The pattern implies that CL may enhance
former in this task, achieving scores in the range 0.53 to a model’s capacity for subsequent learning, making IT
0.56 across several grades and peaking at 0.625 in grade more efective. This finding aligns with insights from
13. Bambi models, on the contrary, seem to find this human developmental learning, where structured
prokind of questions more challenging, particularly con- gression lays the groundwork for improved adaptability
sidering grades 2 to 10. However, Bambi, Bambi_CL_it, and generalization over time 4.
and Bambi_mc4 exceed the above-chance level in var- These results give rise to some puzzling observations
ious grades. In particular, the performance of Bambi, that merit closer examination. For instance, when
comBambi_it, and Bambi_mc4 peaks at grade 13, reaching an paring the Bambi models with their mc4-trained
counaccuracy around 0.40. terparts, substantial diferences appear only in grades 2
(although this grade includes only two items) and 6 of the
3.5. Discussion language items. This prompts the question of whether
using ecologically plausible data is as crucial as often
asThe Invalsi-ITA benchmark appears to be challenging sumed, or if standard training corpora, such as mc4, can
for all the models under investigation, as none of them produce comparable results. In fact, the Bambi_mc4
modexceed an accuracy value of 0.60. It should be kept in els perform comparably to other Bambi models in many
mind, however, that Invalsi tests are also challenging for settings, indicating that the choice of data alone does not
Italian students [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. 3. yeld substantial diference. However, they do not clearly
        </p>
        <p>The larger models, i.e., Cerbero-7B, Minerva-7B and outperform the Bambi models either: they achieve their
Minerva-7B-it, perform overall better in this benchmark, best relative result in grade 5 of the language items, but
especially when they are instruction-tuned. The reason in all other grades and tasks they perform worse or at
may lie in the nature of Invalsi-ITA. This benchmark con- best match the level of at least one of the Bambi variants.
sists indeed of text comprehension items and language This pattern suggests that while web training data can
apitems, which specifically address normative grammatical proximate the results of carefully curated child-directed
rules, instead of the models’ linguistic competence tout- speech to some extent, it does not consistently provide
court. Naturally, models which are exposed to a larger an advantage, highlighting the need for a deeper analysis
amount of training data and, even more importantly, to a of the interactions between data quality, structure, and
large amount of written data, may be facilitated in these curriculum learning.
kinds of tasks, either because they have been exposed to Another notable result is the unexpected jump in
perthe actual texts used in the benchmark, or because they formance for the Bambi_CL models in grade 6 with
reare more used to this kind of linguistic input. spect to the language items. One possible explanation</p>
        <p>Nonetheless, Bambi models exhibit a great improve- lies in the CL learning strategy: although the total
numment in grade 13 with respect to the text comprehension ber of tokens processed by these models over multiple
items, and some of them perform comparably to larger epochs approaches the lifetime exposure of an
18-yearmodels with respect to language items (e.g, in grades old adolescent, the absolute size of the Bambi dataset
2 and 6). These results suggest that compact models, more closely reflects the typical linguistic input of a child
despite lacking comprehensive world knowledge, can aged six to eight. This alignment may account for the
develop robust grammatical knowledge at early stages relatively strong results in grade 6, which corresponds
of training. Furthermore, considering binary questions, to the final portion of the training curriculum.
Howmost of them, particularly Bambi_CL_it, Bambi_mc4 and ever, this interpretation does not readily explain another
Bambi_mc4_it, perform comparably to larger models in surprising outcome: in the text comprehension task for
specific grades despite their compact size and training grade 13, the Bambi and Bambi_mc4 models outperform
constraints, suggesting the potential benefits of a combi- not only Bambi_CL and Bambi_CL_it, but also larger
nation of oral and written training data. models like Minerva and Cerbero-7B. This could be an</p>
        <p>
          Turning to curriculum learning and instruction tun- artifact of the limited number of items in this grade, but
ing, a closer examination of the diferent Bambi models it highlights an area where further investigation is
warindicates that each strategy contributes modest gains, ranted to understand how data composition, curriculum
3Unfortunately, the benchmark does not provide student-level data. 4We acknowledge the importance of cross-linguistic validation. To
However, the paper describing the original resource [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] includes this end, we have submitted a related study to the third BabyLM
a bar plot illustrating the performance gap, which highlights the Challenge [28], which is currently under review. Preliminary results
challenges faced by Italian students. on English show a similar trend.
pacing, and task type interact in shaping model behavior.
        </p>
        <p>Taken together, these findings highlight several key
insights. First, larger model size alone does not guarantee
superior performance: smaller models can be competitive
in specific cases, particularly in structurally simpler tasks.</p>
        <p>Second, apparently, training strategies such as CL and IT
yeld efective improvements only under specific
evaluation conditions. Finally, the performance gap between
BabyLM and LLM remains substantial, particularly in
tasks requiring semantic depth understanding or world
knowledge. Closing this gap without compromising
cognitive and linguistic plausibility remains a key challenge.</p>
        <p>Future work will need to explore new training strategies.
and evaluation frameworks to address it.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Conclusion</title>
      <sec id="sec-3-1">
        <title>In this work, we presented an evaluation of six Bambi model variants alongside five larger models, using the Invalsi-ITA benchmark, which assesses text comprehension and linguistic abilities.</title>
        <p>This evaluation revealed that larger models are
facilitated in the text comprehension task, because either they
have already encountered the texts used in the
benchmark or they are more used to this kind of linguistic
input. Nonetheless, smaller but more cognitively plausible
models appear to be facilitated in the learning and
generalization processes, as highlighted by their improvement
in higher grades considering both text comprehension
and language items.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>We acknowledge financial support under the PRIN</title>
        <p>2022 Project Title "Computational and linguistic
benchmarks for the study of verb argument structure" – CUP
I53D23004050006 - Grant Assignment Decree No. 1016,
07/07/2023 by the Italian Ministry of University and
Research (MUR), funded by the European Commission
under the NextGeneration EU programme. This research
was also partly funded by PNRR—M4C2—Investimento
1.3, Partenariato Esteso PE00000013—“FAIR—Future
Artiifcial Intelligence Research”—Spoke 1 “Human-centered
AI,” funded by the European Commission under the
NextGeneration EU programme.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>A. Appendix A: Accuracy Values for Invalsi-ITA</title>
      <p>Declaration on Generative AI</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Capone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Suozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Lebani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          , et al.,
          <article-title>BaBIEs: A Benchmark for the Linguistic Evaluation of Italian Baby Language Models</article-title>
          ,
          <source>in: Proceedings of the Tenth Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Suozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Capone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Lebani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          , BAMBI:
          <article-title>Developing BAby language Models for Italian, Lingue e linguaggio, Rivista semestrale (</article-title>
          <year>2025</year>
          )
          <fpage>83</fpage>
          -
          <lpage>102</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Puccetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cassese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          ,
          <article-title>The invalsi benchmarks: measuring the linguistic and mathematical understanding of large language models in italian</article-title>
          ,
          <source>in: Proceedings of the 31st International Conference on Computational Linguistics</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>6782</fpage>
          -
          <lpage>6797</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Wilcox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Choshen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cotterell</surname>
          </string-name>
          , T. Linzen,
          <article-title>Bigger is not always better: The importance of human-scale language modeling for psycholinguistics</article-title>
          ,
          <source>Journal of Memory and Language</source>
          <volume>144</volume>
          (
          <year>2025</year>
          )
          <fpage>104650</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>What artificial neural networks can tell us about human language acquisition, in: Algebraic structures in natural language</article-title>
          ,
          <year>2022</year>
          , pp.
          <fpage>17</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <article-title>Understanding natural language understanding systems</article-title>
          ,
          <source>Sistemi intelligenti 35</source>
          (
          <year>2023</year>
          )
          <fpage>277</fpage>
          -
          <lpage>302</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Connell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lynott</surname>
          </string-name>
          ,
          <article-title>What can language models tell us about human cognition?</article-title>
          ,
          <source>Current Directions in Psychological Science</source>
          <volume>33</volume>
          (
          <year>2024</year>
          )
          <fpage>181</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.-D.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Schuler</surname>
          </string-name>
          ,
          <article-title>Why does surprisal from larger transformer-based language models provide a poorer fit to human reading times?</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>11</volume>
          (
          <year>2023</year>
          )
          <fpage>336</fpage>
          -
          <lpage>350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>A. De Varda</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Marelli</surname>
          </string-name>
          ,
          <article-title>Scaling in cognitive modelling: A multilingual approach to human reading times, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>139</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Choshen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wilcox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ciro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mosquera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Paranjabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Linzen</surname>
          </string-name>
          , et al.,
          <article-title>Findings of the BabyLM Challenge: Sample-eficient pretraining on developmentally plausible corpora</article-title>
          ,
          <source>in: Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Linzen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cotterell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Choshen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Wilcox</surname>
          </string-name>
          ,
          <article-title>Findings of the second BabyLM challenge: Sample-eficient pretraining on developmentally plausible corpora</article-title>
          ,
          <source>in: The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>When Do You Need Billions of Words of Pretraining Data?</article-title>
          ,
          <source>in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Nat- H</source>
          . Jun,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plappert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tworek</surname>
          </string-name>
          , J. Hilton,
          <source>ural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nakano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          , Training veri2021, pp.
          <fpage>1112</fpage>
          -
          <lpage>1125</lpage>
          . ifers to solve math word problems, arXiv preprint
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Huebner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sulem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cynthia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          , Baby- arXiv:
          <fpage>2110</fpage>
          .14168 (
          <year>2021</year>
          ).
          <article-title>BERTa: Learning more grammar with small-scale [24] Mattimax, Italian conversations dataset by m.inc, child-directed language</article-title>
          ,
          <source>in: Proceedings of the 2025</source>
          . URL: https://huggingface.co/datasets/ 25th conference
          <article-title>on computational natural language Mattimax/DATA-AI_Conversation_ITA, dataset learning</article-title>
          ,
          <year>2021</year>
          , pp.
          <fpage>624</fpage>
          -
          <lpage>646</lpage>
          . of over
          <volume>10</volume>
          ,
          <article-title>000 prompt-response pairs in Italian,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <article-title>released by M.INC for training language models</article-title>
          . S. Borgeaud,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yogatama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , [25]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hashimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddhant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          , mT5:
          <string-name>
            <surname>A massively P. Liang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Fedus</surname>
          </string-name>
          ,
          <article-title>Emergent Abilities of multilingual pre-trained text-to-text transformer</article-title>
          ,
          <source>Large Language Models, Transactions on Machine in: Proceedings of the 2021 Conference of the North Learning Research</source>
          (
          <year>2022</year>
          ).
          <article-title>American Chapter of the Association for Computa-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Michael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Levy</surname>
          </string-name>
          , tional Linguistics:
          <article-title>Human Language Technologies</article-title>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <string-name>
            <surname>GLUE:</surname>
          </string-name>
          <article-title>A multi-task benchmark and 2021</article-title>
          , pp.
          <fpage>483</fpage>
          -
          <lpage>498</lpage>
          .
          <article-title>analysis platform for natural language understand-</article-title>
          [26]
          <string-name>
            <given-names>R.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-L. H.</given-names>
            <surname>Cabot</surname>
          </string-name>
          , S. Conia, ing,
          <source>in: Proceedings of the 2018 EMNLP Workshop E</source>
          . Barba,
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlandini</surname>
          </string-name>
          , G. Fiameni, R. Navigli, MinBlackboxNLP: Analyzing and
          <article-title>Interpreting Neural erva llms: The first family of large language models Networks for</article-title>
          NLP,
          <year>2018</year>
          , pp.
          <fpage>353</fpage>
          -
          <lpage>355</lpage>
          .
          <article-title>trained from scratch on italian data</article-title>
          ,
          <source>in: Proceedings</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pruksachatkun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nangia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <source>of the Tenth Italian Conference on Computational J</source>
          .
          <string-name>
            <surname>Michael</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Hill</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bowman</surname>
          </string-name>
          , Superglue: Linguistics (CLiC-it
          <year>2024</year>
          ),
          <year>2024</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>719</lpage>
          .
          <article-title>A stickier benchmark for general-purpose language</article-title>
          [27]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Galatolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Cimino</surname>
          </string-name>
          , Cerbero-7B:
          <article-title>A Leap understanding systems</article-title>
          ,
          <source>Advances in neural infor- Forward in Language-Specific LLMs Through Enmation processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
          <article-title>hanced Chat Corpus Generation and Evaluation,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parrish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohananey</surname>
          </string-name>
          , arXiv preprint arXiv:
          <volume>2311</volume>
          .15698 (
          <year>2023</year>
          ). W. Peng,
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          , BLiMP: The [28]
          <string-name>
            <given-names>L.</given-names>
            <surname>Charpentier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Choshen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cotterell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. O.</given-names>
            <surname>Gul</surname>
          </string-name>
          , Benchmark of Linguistic Minimal Pairs for English,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jumelet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Linzen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <article-title>Transactions of the Association for Computational C</article-title>
          .
          <string-name>
            <surname>Ross</surname>
          </string-name>
          , et al.,
          <source>BabyLM Turns 3: Call for papers Linguistics</source>
          <volume>8</volume>
          (
          <year>2020</year>
          )
          <fpage>377</fpage>
          -
          <lpage>392</lpage>
          . for the 2025 BabyLM workshop, arXiv preprint
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Evanson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lakretz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>King</surname>
          </string-name>
          , Language ac- arXiv:
          <fpage>2502</fpage>
          .10645 (
          <year>2025</year>
          ).
          <article-title>quisition: do children and language models follow similar learning stages?</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>12205</fpage>
          -
          <lpage>12218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mercorio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mezzanzanica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Potertì</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Serino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Seveso</surname>
          </string-name>
          , Disce aut Deficere:
          <article-title>Evaluating LLMs Proifciency on the INVALSI Italian Benchmark</article-title>
          ,
          <source>arXiv preprint arXiv:2406.175352</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Santilli</surname>
          </string-name>
          , E. Rodolà,
          <article-title>Camoscio: an Italian Instruction-tuned LLaMA</article-title>
          ,
          <source>in: Proceedings of the Nineth Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2023</year>
          ),
          <year>2023</year>
          , pp.
          <fpage>385</fpage>
          -
          <lpage>395</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Taori</surname>
          </string-name>
          , I. Gulrajani,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dubois</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Hashimoto</surname>
          </string-name>
          ,
          <article-title>Alpaca: A strong, replicable instruction-following model</article-title>
          ,
          <source>Stanford Center for Research on Foundation Models</source>
          <volume>3</volume>
          (
          <year>2023</year>
          )
          <article-title>7</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kordi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khashabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <article-title>Self-instruct: Aligning language models with self-generated instructions</article-title>
          ,
          <source>in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>13484</fpage>
          -
          <lpage>13508</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cobbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kosaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bavarian</surname>
          </string-name>
          , M. Chen,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>