<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BaBIEs: A Benchmark for the Linguistic Evaluation of Italian Baby Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Capone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alice Suozzi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianluca E. Lebani</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Lenci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica, Università di Pisa</institution>
          ,
          <addr-line>Via Santa Maria 36, 56126 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>European Centre for Living Technology (ECLT)</institution>
          ,
          <addr-line>Ca' Bottacin, Dorsoduro 3911, 30123 Venice</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>QuaCLing Lab, Dipartimento di Studi Linguistici e Culturali Comparati, Università Ca' Foscari Venezia</institution>
          ,
          <addr-line>Dorsoduro 1075, 30123 Venice</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The possibility of comparing the linguistic competence of Language Models (LMs) to that of children has gained growing attention lately, raising the need for efective tools for evaluating both the former and the latter. To this purpose, we developed a resource for the linguistic evaluation of BabyLMs, which are LMs trained on datasets that comparable to the linguistic stimulus received by children. This resource adapts four standardized tests for the evaluation of linguistic skills of Italianspeaking children (BVL, TROG-2, TCGB-2 and Peabody). To verify the efectiveness of our benchmark, we administered it to Minerva, a LLM pretrained from scratch on Italian. Our results indicate that Minerva struggles to master certain linguistic aspects, achieving an age-equivalent score of 4 years, and that the type of task administered afects the model's performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Language Models</kwd>
        <kwd>Linguistic Evaluation</kwd>
        <kwd>Benchmark</kwd>
        <kwd>BabyLMs</kwd>
        <kwd>Language Acquisition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>the light of the experiments in Section 5. Finally, in
Section 6, some conclusions and possible future research
directions are outlined.
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
Dec 04 — 06, 2024, Pisa, Italy
* Corresponding author.
† For the specific purposes of Italian Academy, Luca Capone is
responsible for Sections 2 and 4.1, Alice Suozzi and Luca Capone for
Section 3, Alice Suozzi for Sections 4.2 and 5, Alessandro Lenci and
Gianluca E. Lebani for Sections 1 and 6.
$ luca.capone@fileli.unipi.it (L. Capone); alice.suozzi@unive.it
(A. Suozzi); gianluca.lebani@unive.it (G. E. Lebani);
alessandro.lenci@unipi.it (A. Lenci)</p>
      <p>
        0000-0002-1872-6956 (L. Capone); 0000-0002-5215-7742
(A. Suozzi); 0000-0002-3588-1077 (G. E. Lebani);
0000-0001-5790-4308 (A. Lenci)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 10 million words per year on average, reaching around
Attribution 4.0 International (CC BY 4.0).
100 million words by age 10. Zhang et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] demonstrate Language Technology in Italian) archive; (ii) IT5 [16],
that substantial amounts of data are required to achieve which focuses on summarization tasks; (iii) the Invalsi
good results in NLU tasks, such as those evaluated by benchmark [17], which evaluates the mathematical and
SuperGLUE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Performance improvements become no- linguistic competences of LMs in Italian. Only the latter
ticeable after surpassing the threshold of 1 billion words is relevant to our study, as it allows a comparison
beand continue to improve steadily even beyond 30 billion tween human language learning (in the school-age range
words. However, tasks focusing on language syntax (e.g., 6-18 years) and that of the models. However, the age
acceptability judgment and minimal pairs) exhibit the range considered by Invalsi involves more sophisticated
most significant improvements between 1 million and 100 NLU tasks, rather than the fundamental linguistic
abilimillion words, after which the learning curve plateaus. ties learned during the preschool period, within the 100
The authors conclude that while acquiring factual knowl- million word budget.
edge necessitates large volumes of text, syntactic and
semantic competence reaches saturation within the range
of 10 million to 100 million words. Similar conclusions are 3. Nurturing BaBIEs
reported by Wei et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], who investigate the emergent
skills of various LLMs, confirming that the most sophis- In order to evaluate the linguistic abilities of BabyLMs,
ticated behaviors primarily arise from scaling up model we developed BaBIEs by adapting four standardized tests
training. These findings justify the focus on BabyLMs, designed to assess the linguistic competence of
Italianwhich are LMs trained on limited amounts of data, quali- speaking children. These tests, which tap into diferent
tatively resembling the stimuli received by a preschooler. aspects of linguistic competence, are:
Huebner et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] illustrate this approach by training • Batteria per la Valutazione del Linguaggio in
BamBabyBERTa on 50 million words of child-directed speech bini dai 4 ai 12 anni (BVL) ’Battery for the
Asand simplified written text, achieving results comparable sessment of Language in Children aged 4 to 12’
to RoBERTa-base on a grammar test suite. The BabyLM [18]. BVL is designed to provide a global
linguischallenges [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] fall within this line of research, aiming tic profile of Italian-speaking children and was
to optimize model training through curriculum learning standardized on a sample of 1,086 children aged
(CL) techniques and architectural optimizations. This 4 to 12. It consists of 18 tasks (e.g., semantic and
approach not only makes research more afordable, but phonological fluency, sentence and word
comalso results in models that are more cognitively plausible prehension, emotional prosody comprehension,
in comparison to human language acquisition. Although etc.) grouped into three sections, i.e., production,
the proposed CL techniques did not lead to consistent comprehension, and repetition.
improvements across all evaluation tasks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], it has been
demonstrated that a model trained with limited data (10
million words) can achieve results comparable to those
of large LMs on various benchmarks.
2.2. Baby benchmarks for Baby models
      </p>
      <sec id="sec-1-1">
        <title>These results prompt a reconsideration of the comparabil</title>
        <p>ity between LMs training and human language learning.</p>
        <p>
          While benchmarks like BLiMP [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and GLUE [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
facilitate comparisons between diferent models, they are
not suitable for comparing BabyLMs to children who are
acquiring a first language. Several studies attempt to
address this shortcoming. For instance, Evanson et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
compare the learning order of certain syntactic structures
in English between GPT-2 and preschoolers. They find
that the model exhibits a consistent order in learning
syntactic structures, which aligns with the one observed
in preschoolers. Other tests that compare training in LMs
to human language acquisition include the reading time
test [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and the age-of-acquisition test [14].
        </p>
        <p>For the Italian language, the three main benchmarks
are: (i) UINAUIL [15], which includes six NLU tasks
selected from the EVALITA (Evaluation campaign for
• Peabody - Test di vocabolario recettivo (Italian
adaptation of the Peabody Picture Vocabulary Test
- Revised) [19, 20]. PPVT-R is intended to measure
the receptive vocabulary of the subject and was
standardized on a sample of 2,400 aged 3 to 12
and 16. It consists of 175 items.
• Test for Reception of Grammar - Version 2 (TROG-2)
[21]. TROG-2 is designed to assess the
comprehension of verbal language, especially syntactic
structures, and was standardized on a sample of
1,276 subjects aged 4 to 87. It consists of 20 blocks,
each containing four items that focus on a
grammatical structure (e.g., zero anaphor, reversible in
and on, relative clause in object, etc.).
• Test di Comprensione Grammaticale per Bambini</p>
        <p>Seconda Edizione (TCGB-2) ’Test of Grammatical
Comprehension for Children - Second Edition’
[22]. Analogously to TROG-2, TCGB-2 is a tool
for assessing the comprehension of grammatical
structures and was standardized on a sample of
455 children aged 4 to 11. It contains 74 items
which measure the comprehension of six struc- trated boards with four pictures, among which the child
tures, i.e., the phenomenon of inflection, and five must choose the target picture that depicts the verbal
types of sentences: locative, active, passive, rela- stimulus. Adapting these items involved converting the
tive and dative. pictures into linguistic expressions, either single words
or complex sentences, which consist of the linguistic
deIt is worth noting that all tests are standardized on sam- scription of the distractor and target drawings. In the
ples of typically-developing Italian-speaking subjects and Sentence Comprehension task, the pictures were
conare designed to be orally administered. That is, the stim- verted into sentences maintaining the lexical items
conuli are always read by the experimenter, and the child is stant whenever possible, and only altering the syntactic
asked either to answer orally or to point at a picture. structure. This way, the target difers from the
stimu</p>
        <p>BaBIEs consists of five tasks (see Table 4 in Appendix lus syntactically, but not lexically. For instance, given
A): this resource is twofold: (i) Sentence Completion (the the linguistic stimulus la pecora è spinta dal ragazzo ’the
only task assessing linguistic production), (ii) Accept- sheep is pushed by the boy’, the possible answers are:
ability Judgment, (iii) Idiom Comprehension, (iv) Sentence cioè il ragazzo indica la pecora; cioè la pecora spinge il
Comprehension, (v) Lexical Comprehension. These tasks ragazzo; cioè il ragazzo spinge la pecora (TARGET); cioè
are taken from BVL. We added 165 out of 175 items from il ragazzo guarda la pecora ’that is, the boy indicates the
Peabody (Lexical Comprehension task) and all the items sheep; that is, the sheep pushes the boy; that is, the boy
contained in TROG-2 and TCGB-2 (both Sentence Com- pushes the sheep (TARGET); that is, the boy looks at
prehension tasks).1 Except for the Sentence Completion the sheep’. Since the relevant structure is the reversible
task and the Acceptability Judgment task, all of the oth- passive, target and distractors are active clauses with
ers are similarly-structured comprehension tasks. The the same lexical items as the linguistic stimulus. For
child is presented with an oral linguistic stimulus (i.e., a the Lexical Comprehension task, the converted target
word, a sentence or an idiom) and with a set of three or and distractors can be full sentences (especially if the
four possible answers, from which the child must choose stimulus is a verb), words, or phrases. Since the target
the answer corresponding to the linguistic stimulus (the converted from the target picture can not be identical to
target). Together, a stimulus and its set of possible an- the stimulus word, we used a linguistic expression that
swers constitute a test item. The key factor in the process is semantically-related to the stimulus (e.g., a synonym,
of item adaptation from the original tests to BaBIEs was hypernym, hyponym, etc.). For instance, given the
stimthe modality in which the sets of possible answers are ulus un trattore ’a tractor’, the set of possible answers
displayed. is cioè un microscopio; cioè una ruspa (TARGET); cioè un</p>
        <p>For the Acceptability Judgment task, we constructed binocolo; cioè una bicicletta ’that is, a microscope; that
minimal pairs of sentences by creating a grammatical or is, a bulldozer (TARGET); that is, binoculars; that is, a
ungrammatical version of the verbal stimulus (depending bicycle’. The target is una ruspa ’a bulldozer’, which is
on the (un)grammaticality of the original stimulus). In semantically-related to the stimulus.
this task, the model receives one pair at a time. Its choice The adapted version of the Lexical Comprehension
is determined by perplexity, with the sentence having tasks (BVL and Peabody) functions as follows: each item
the lowest perplexity score being chosen by the model. comprises a textual lexical stimulus (a word) followed</p>
        <p>For the Sentence Completion and Idiom Comprehen- by a textual adaptation of the possible corresponding
sion tasks, as both the stimuli and the sets of possible pictures, referred to hereafter as textual options (cf.
Apanswers are linguistic expressions, the adaptation pro- pendix A). The lexical stimulus is concatenated with each
cess only involved reformatting them to be readable by possible textual option to form four complex sentences.
the model. The Sentence Completion task is modeled Noteworthy, we choose to concatenate the stimulus to
in a fill-in-the-blank format. The LM is given a textual each textual option by means of cioè ’that is’, a
conjuncsentence to complete, it receives one item at a time as tion used to clarify or restate something previously
meninput and generates up to three new tokens. The answer tioned, which is particularly suited to make explicit the
is considered correct if the correct completion appears relationship between the the stimulus and the textual
in the generated sequence. options. The model’s choice is determined based on the</p>
        <p>In contrast, the items for the Sentence and Lexical Com- perplexity obtained for each sentence. The same applies
prehension tasks required substantial adaptation because to the Sentence Comprehension tasks, which comprises
these tasks involve pictures in their original version. The items from the Sentence and Idiom Comprehension tasks
sets of possible answers are indeed presented on illus- (BVL, TROG-2, and TCGB-2). Some examples of adapted
items (one per task) and the structure of the entire dataset
are given in Appendix A.
110 out of 175 items from Peabody were excluded, because either
the words were too rare to be known by BabyLMs, e.g., emaciato
‘emaciated’, or it was impossible to adapt the item without using
visual stimuli, e.g., for quadrato ‘square’.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Testing BaBIEs with Minerva</title>
      <p>across all tasks is illustrated in Figure 1. Complete
results, including accuracy for each clause type (Sentence
4.1. Model Comprehension task - BVL, TROG-2, TCGB-2) and
partof-speech (Lexical Comprehension task - Peabody), are
To verify the efectiveness of this test, it was presented provided in Appendix B. Minerva obtains the highest
to a LM. Since no Italian LM primarily trained on child- accuracy in the Acceptability Judgment task (BVL) by
directed speech and through curriculum learning was far, with 17/18 true predictions and an accuracy of 0.94.
available, we opted for a conventional Italian LM2. Specif- Considering the standard scores, this falls between -1SD
ically, we chose Minerva-3b-base-v1.0 (hereafter re- and +1SD for the age range 6.0-11,11 years (11,11 being
ferred to as Minerva) [24], a decoder-only model (based the last age considered in the standardization of BVL). 3
on Mistral [25]) with 3 billion parameters. The choice was The accuracy is lower for the Sentence Completion task
determined by the fact that, unlike other available mod- (BVL), which - it is worth repeating - is the only
producels, Minerva was developed as an Italian model, despite tion task, i.e., 0.43, with 6/14 true predictions. This score
also being pre-trained on a substantial amount of English is positioned between -1SD and +1SD for the age range
text (660 billion tokens, 50% Italian and 50% English). 4,0-5,5 years. In the Idiom Comprehension Task (BVL),
For the experiments, the Huggingface implementation of the true predictions given by Minerva are 5/10, and the
the model was used. For the Sentence Completion task, accuracy is of 0.5. This score is only seemingly low.
Inwe chose beam search as a generation strategy, with 3 deed, it falls between -1SD and +1SD for the age range
beams. The models sampled the next generated token 6,6-8,11 years and beyond +2SD for the age range 4,0-4,5
among the 50 most probable words. We combined this years. Let us now turn to the Sentence and Lexical
Comstrategy with nucleus sampling, by setting a probability prehension tasks (which involve picture-to-language
conthreshold of 0.95. version). We used three Sentence Comprehension tasks
(from BVL, TCGB-2, TROG-2), which tap into partially
4.2. Results diferent clause types (cf. Appendix B). In the BVL task,
20/40 true predictions are given by the model,
corresponding to an accuracy of 0.5. The score is between -1SD and
0 for the age range 4,0-4,11 years. In the TCGB-2 task,
the true predictions are 33/74, and the accuracy is 0.44.</p>
      <p>The performance of Minerva is measured in terms of
accuracy (number of true predictions relative to the total
number of items). This measure is also used for
evaluating children, allowing us to utilize standard scores to
evaluate the model. The accuracy achieved by Minerva
2A new BabyLM [23] has been released a few weeks before the
submission deadline. However, this model is not originally Italian
but instead focuses on second language acquisition and its impact
on the performance of a BabyLM.</p>
      <sec id="sec-2-1">
        <title>3In standardized tests, the most frequent score obtained by children</title>
        <p>of a given age range is represented by 0. The typical range score
extends from -2SD to +2SD from 0. For scores below -2SD, the
performance is considered deficient. In this study, we consider the
score range -1SD to +1SD, as we are not interested in potential
language impairments.</p>
        <p>
          According to the standard scores of TCGB-2, the model is ‘neither...nor’). Minerva selects the correct answer for
placed between the 32nd and 45th percentiles for the age 9/28 negative clauses (32.14%); of these, two are passives,
range 3,6-3,11 years. These percentiles correspond to the six are active clauses, of which one contains a double
judgment of within normal range (as opposed to excellent, negation. Wrong answers are selected for 19/29 negative
good, etc.) In the task adapted from TROG-2, Minerva clauses (67.86%), of which 6 are passives, 13 are active
reaches an accuracy of 0.42 (with 34/80 true predictions). clauses, of which 5 containing a double negation. Four
In this test, the number of passed/failed blocks is relevant examples of wrong answers selected by Minerva are
reto the purposes of standard scores (a block being passed ported in Table 1. Such errors suggest that the model
if the child provides the target response for at least 3/4 does not interpret negation, or in the case of clauses
items). The model passes 6/20 blocks, obtaining an age- containing double negation, at least one of them,
consisequivalent score of 4,1 years. The standard score for this tent with previous findings in the literature ([ 26], [27]).
age is 115, which falls into the 84th percentile. Finally, The complete sets of possible answers of the examples
we used two Lexical Comprehension item sets (from BVL reported in Table 1) are given in Appendix C.
and Peabody). In the former (BVL), Minerva provides As can be seen in Table 1, the wrong answers selected
5/18 true predictions, that correspond to an accuracy of by Minerva result from the failure to interpret the
nega0.37. This score is below -2SD for the age range 4,0-4,5 tion. In one case (i.e., the third example), the selected
years (4,0 years is the minimum age considered for the answer reveals that the model only interpreted the
secstandardization). In the latter (Peabody), 62/165 predic- ond (but not the first) negation.
tions are true, the accuracy being 0.37. As mentioned The best score is obtained in the Acceptability
Judgabove, we excluded 10 items from the adaptation process. ments task. This is not surprising and primarily due to
Since the test age-equivalent scores are computed based the task being formulated with minimal pairs, a method
on 175 items, we consider the raw-score range of 62-72 proven to be particularly efective in testing LMs [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
to establish the age-equivalent score of Minerva, so as to In the other tasks, the results are worse. Nonetheless,
also take into account the excluded items. This raw-score the age-equivalent score is not the whole story. In the
range corresponds to the age-equivalent score range of Sentence Completion task, for instance, in spite of the
102-109 for the age range 3,9-4,2 years (i.e., between 0 and low score obtained, the completions are not
ungram+1SD) and 92-99 for the age range 4,3-4,8 (i.e., between matical or nonsensical (cf. Table 2, more examples are
-1SD and 0). provided in Appendix C). In the Lexical Comprehension
tasks, the score further decreases. The results in both
tasks (from BVL and Peabody) are fairly consistent, with
5. Discussion an age score struggling to reach 4,5 years. The dificulties
encountered by the model can be attributed to the limited
context and the nature of the task, which is primarily
semantic. The model also performs well in the Idiom
Comprehension task, probably because idiomatic
expressions are high-frequency expressions that a model trained
on large amount of texts might easily have encountered.
        </p>
        <p>This could also explain why the score is lower for the
Sentence Comprehension tasks, although the two are
structurally similar. Indeed, unlike idiomatic expressions,
the items of these tasks are less predictable and require a
certain degree of inference for resolution, making their
complexity more similar to that of Lexical
Comprehension tasks.</p>
        <p>The scores obtained by Minerva generally align with the
linguistic-age range 4.0-5.0. Variability in scores is
observed i.) across diferent tasks, indicating that certain
tasks may be easier for the model than others; and ii.)
within the same type of task depending on the specific
test they were adapted from (e.g., BVL–Sentence
Comprehension, TROG-2). This discrepancy may be due to the
adaptation of the test items, which, in turn, depends on
the original distractor and target pictures. For instance,
items in the Lexical Comprehension task of BVL required
the model to make inferences to generate accurate
predictions. Another possible factor (e.g., in the Sentence
Comprehension task) is the complexity of specific
syntactic structures evaluated by some tests. For instance,
locative structures are particularly challenging for the
model, as are passive clauses (cf. Appendix B). The model
often fails to consistently grasp the rationale linking the
stimulus and the target answer, likely due to Minerva not
being an instruction-tuned model. Negation (Sentence
Comprehension Task) is an illustrative example in this
respect. BaBIEs contains 28 negative clauses (8/28 are
passive clauses, and 20/28 are active clauses. Among the
active clauses, 6 contain a double negation, i.e., né...né</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Conclusions and future work</title>
      <sec id="sec-3-1">
        <title>This paper presents BaBIEs, a novel resource specifi</title>
        <p>cally designed to evaluate the linguistic competence of
BabyLMs and compare them to those of children. After
having detailed the sources and the creation process of
this resource, we provided the procedure for testing the
Minerva model with the resource itself. Finally, we
presented and discussed the results the model’s performance.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>We acknowledge financial support under the PRIN</title>
        <p>2022 Project Title "Computational and linguistic
benchBased on the presented findings, the resource appears a marks for the study of verb argument structure" – CUP
valuable tool for evaluating not only BabyLMs but LMs I53D23004050006 - Grant Assignment Decree No. 1016
in general. The poor performance exhibited by Minerva adopted on 07/07/2023 by the Italian Ministry of
Univerunderscores the gap between child language acquisition sity and Research (MUR). This research was also partly
and current language model training. This highligths the funded by PNRR—M4C2—Investimento 1.3, Partenariato
necessity for modifying model training to better encode Esteso PE00000013—“FAIR—Future Artificial Intelligence
human language and, more generally, human linguistic Research”—Spoke 1 “Human-centered AI,” funded by the
competence. European Commission under the NextGeneration EU</p>
        <p>Future work will involve a more systematic linguis- programme.
tic analysis of the model’s performance, together with a
comprehensive error analysis and a comparison to adult
Italian-speakers. Furthermore, it will involve the devel- References
opment of a multimodal version of the test, which will
more closely reflect the original tests and allow the
evaluation of multimodal BabyLMs. Additionally, a BabyLM
trained exclusively with Italian child-directed speech will
be developed and evaluated with both the standard and
multimodal versions of the test.
of the Association for Computational Linguistics:
Human Language Technologies, Online, 2021, pp.</p>
        <p>1301–1312.
[27] T. H. Truong, T. Baldwin, K. Verspoor, T. Cohn,</p>
        <p>Language models are not naysayers: an analysis
of language models on negation benchmarks, in:
A. Palmer, J. Camacho-collados (Eds.), Proceedings
of the 12th Joint Conference on Lexical and
Computational Semantics (*SEM 2023), Toronto, Canada,
2023, pp. 101–114.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>A. Appendix A: Examples of adapted items</title>
    </sec>
    <sec id="sec-6">
      <title>B. Appendix B: Complete Results</title>
      <p>C. Appendix C: Examples of Target and Wrong Answers Provided by</p>
    </sec>
    <sec id="sec-7">
      <title>Minerva</title>
      <p>La ragazza non sta né indicando
né correndo
‘The girl is neither pointing</p>
      <p>nor running’
La scatola non è né grande</p>
      <p>né gialla
‘The box is neither big
nor yellow’
1. La bambina sta correndo</p>
      <p>‘The girl is running’
2. Le bambine stanno correndo
‘The girls are running’
3. La bambina raggiunge la mamma
‘The girl reaches her mom’
4. La bambina è ferma
‘The girl is still’
1. Il cestino è vuoto
‘The bin is empty’
2. Il cestino è pieno</p>
      <p>‘The bin is full’
3. La mamma svuota il cestino
‘The mom empties the bin’
4. Il bambino ha svuotato</p>
      <p>il cestino
‘The boy has emptied the bin’
1. La ragazza corre</p>
      <p>ma non indica
‘The girl is running
but not pointing’
2. La ragazza è ferma</p>
      <p>‘The girl is still’
3. La ragazza corre e indica
‘The girl is running and pointing’
4. La ragazza indica</p>
      <p>ma non corre
‘The girl is pointing</p>
      <p>but not running’</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Scaling laws for neural language models</article-title>
          , arXiv preprint arXiv:
          <year>2001</year>
          .
          <volume>08361</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Villalobos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sevilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Heim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Besiroglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hobbhahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <article-title>Will we run out of data? an analysis of the limits of scaling datasets in machine learning</article-title>
          ,
          <source>arXiv preprint arXiv:2211.04325</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>What artificial neural networks can tell us about human language acquisition</article-title>
          , in: S. Lappin, J.-P. Bernardy (Eds.),
          <article-title>Algebraic ory in 11 languages, Transactions of the Association structures in natural language</article-title>
          , CRC Press,
          <source>Boca for Computational Linguistics</source>
          <volume>11</volume>
          (
          <year>2023</year>
          )
          <fpage>1451</fpage>
          -
          <lpage>1470</lpage>
          . Raton,
          <year>2022</year>
          , pp.
          <fpage>17</fpage>
          -
          <lpage>60</lpage>
          . [14]
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. K.</given-names>
            <surname>Bergen</surname>
          </string-name>
          , Word acquisition in
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>neural language models, Transactions of the AsWhen Do You Need Billions of Words of Pretraining sociation for Computational Linguistics 10 (</article-title>
          <year>2022</year>
          )
          <article-title>Data?</article-title>
          ,
          <source>in: Proceedings of the 59th Annual Meeting 1-16. of the Association for Computational Linguistics</source>
          [15]
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bioglio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          , and the 11th International Joint Conference on Nat- UINAUIL:
          <article-title>A unified benchmark for Italian natural ural Language Processing (Volume 1: Long Papers), language understanding</article-title>
          ,
          <source>in: Proceedings of the 2021</source>
          , pp.
          <fpage>1112</fpage>
          -
          <lpage>1125</lpage>
          .
          <article-title>61st Annual Meeting of the Association for Com-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Risley</surname>
          </string-name>
          ,
          <article-title>Meaningful diferences in the putational Linguistics (Volume 3: System Demoneveryday experience of young American children</article-title>
          ,
          <source>strations)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>348</fpage>
          -
          <lpage>356</lpage>
          . Brookes, Baltimore,
          <year>1995</year>
          . [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sarti</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Nissim, It5: Text-to-text pretraining for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pruksachatkun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nangia</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Singh, italian language understanding and generation</article-title>
          , in: J.
          <string-name>
            <surname>Michael</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Hill</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bowman</surname>
            , Superglue:
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Calzolari</surname>
            , M.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakti</surname>
          </string-name>
          ,
          <article-title>A stickier benchmark for general-purpose language N</article-title>
          . Xue (Eds.),
          <source>Proceedings of the 2024 Joint Inunderstanding systems, Advances in neural infor- ternational Conference on Computational Linguismation processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
          <article-title>tics, Language Resources and Evaluation (LREC-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <surname>COLING</surname>
          </string-name>
          <year>2024</year>
          ),
          <year>2024</year>
          , pp.
          <fpage>9422</fpage>
          -
          <lpage>9433</lpage>
          . S. Borgeaud,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yogatama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          , G. Puccetti, The Invalsi Benchmark: meaD.
          <string-name>
            <surname>Metzler</surname>
          </string-name>
          , et al.,
          <source>Emergent abilities of large lan- suring Language Models Mathematical and Language models, arXiv preprint arXiv:2206</source>
          .
          <article-title>07682 guage understanding in Italian, arXiv preprint (</article-title>
          <year>2022</year>
          ). arXiv:
          <volume>2403</volume>
          .18697 (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Huebner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sulem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cynthia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          , Baby- [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Marini</surname>
          </string-name>
          ,
          <article-title>Batteria per la Valutazione del LinguagBERTa: Learning more grammar with small-scale gio in bambini dai 4 ai 12 anni, Giunti Psychometchild-directed language</article-title>
          ,
          <source>in: Proceedings of the rics, Firenze</source>
          ,
          <year>2015</year>
          . 25th conference on computational natural language [19]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Dunn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Dunn</surname>
          </string-name>
          , Peabody Picture Vocablearning,
          <year>2021</year>
          , pp.
          <fpage>624</fpage>
          -
          <lpage>646</lpage>
          . ulary Test - Revised, American Guidance Service,
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Choshen</surname>
          </string-name>
          , E. Wilcox, Minneapolis,
          <year>1981</year>
          . C.
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ciro</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Mosquera</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Paranjabe</surname>
            , [20]
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Stella</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Pizzioli</surname>
            ,
            <given-names>P. E.</given-names>
          </string-name>
          <string-name>
            <surname>Tressoldi</surname>
          </string-name>
          , Peabody - Test
          <string-name>
            <surname>A. Williams</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Linzen</surname>
          </string-name>
          , et al.,
          <article-title>Findings of the di vocabolario recettivo</article-title>
          , Omega, Torino,
          <year>2000</year>
          . BabyLM Challenge: Sample-eficient pretraining on [21]
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Bishop</surname>
          </string-name>
          ,
          <article-title>Test for Reception of Grammar - Verdevelopmentally plausible corpora</article-title>
          ,
          <source>in: Proceedings sion 2</source>
          ,
          <string-name>
            <surname>Giunti</surname>
            <given-names>Psychometrics</given-names>
          </string-name>
          , Firenze,
          <year>2009</year>
          .
          <article-title>of the BabyLM Challenge at</article-title>
          the 27th Conference on [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chilosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Piazzalunga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pfanner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cipriani</surname>
          </string-name>
          ,
          <source>Computational Natural Language Learning</source>
          ,
          <year>2023</year>
          , Test di Comprensione Grammaticale per Bambinipp.
          <volume>1</volume>
          -
          <fpage>34</fpage>
          . Seconda Edizione, Hogrefe, Firenze,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parrish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohananey</surname>
          </string-name>
          , [23]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          , R.-C. Chen, BAMBINOW. Peng,
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <string-name>
            <surname>BLiMP: The</surname>
            <given-names>LM</given-names>
          </string-name>
          :
          <article-title>(Bilingual-) Human-Inspired Continual benchmark of linguistic minimal pairs for English, Pretraining of BabyLM, arXiv preprint Transactions of the Association for Computational arXiv</article-title>
          :
          <volume>2406</volume>
          .11418 (
          <year>2024</year>
          ).
          <article-title>Linguistics 8 (</article-title>
          <year>2020</year>
          )
          <fpage>377</fpage>
          -
          <lpage>392</lpage>
          . [24]
          <string-name>
            <given-names>R.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-L. H.</given-names>
            <surname>Cabot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          , S. Co-
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Michael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Levy</surname>
          </string-name>
          , nia, E. Barba,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <surname>Minerva-</surname>
          </string-name>
          3b
          <string-name>
            <surname>-baseS. R. Bowman</surname>
          </string-name>
          ,
          <string-name>
            <surname>GLUE:</surname>
          </string-name>
          <article-title>A multi-task benchmark and v1.0, huggingface</article-title>
          .co/sapienzanlp/Minerva-3B
          <article-title>-baseanalysis platform for natural language understand- v1.0 (2024). ing</article-title>
          ,
          <source>in: Proceedings of the 2018 EMNLP Workshop</source>
          [25]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          , C. Bamford, BlackboxNLP: Analyzing and
          <string-name>
            <surname>Interpreting Neural D. S. Chaplot</surname>
            , D. d. l. Casas,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <surname>Networks for</surname>
            <given-names>NLP</given-names>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>353</fpage>
          -
          <lpage>355</lpage>
          . G. Lample,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Mistral</surname>
            <given-names>7b</given-names>
          </string-name>
          , arXiv
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Evanson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lakretz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>King</surname>
          </string-name>
          , Language ac- preprint
          <source>arXiv:2310.06825</source>
          (
          <year>2023</year>
          ).
          <article-title>quisition: do children and language models follow [26]</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , R. D. Hjelm,
          <article-title>similar learning stages?</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd- A. Sordoni</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Courville</surname>
          </string-name>
          , Understanding by underGraber,
          <source>N. Okazaki (Eds.)</source>
          ,
          <article-title>Findings of the Associa- standing not: Modeling negation in language modtion for Computational Linguistics: ACL</article-title>
          <year>2023</year>
          ,
          <year>2023</year>
          , els, in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
          </string-name>
          , L. Zettlemoyer, pp.
          <fpage>12205</fpage>
          -
          <lpage>12218</lpage>
          . D.
          <string-name>
            <surname>Hakkani-Tur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
          </string-name>
          , R. Cotterell,
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Wilcox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pimentel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Meister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cotterell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.), Proceedings of the R. P.
          <article-title>Levy, Testing the predictions of surprisal the- 2021 Conference of the North American Chapter 1. La bambina sta correndo 'The girl is running' (WRONG) 4. Il bambino ha svuotato il cestino 'The boy has emptied the bin' (WRONG) 4. La ragazza indica ma non corre 'The girl is pointing but not running' (WRONG) 2. La scatola è grande e gialla 'The box is big and yellow' (WRONG)</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>