1. Introduction

BaBIEs: A Benchmark for the Linguistic Evaluation of Italian Baby Language Models

Luca Capone

Alice Suozzi

Gianluca E. Lebani

1 2

Alessandro Lenci

0 0 CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica, Università di Pisa , Via Santa Maria 36, 56126 Pisa , Italy 1 European Centre for Living Technology (ECLT) , Ca' Bottacin, Dorsoduro 3911, 30123 Venice , Italy 2 QuaCLing Lab, Dipartimento di Studi Linguistici e Culturali Comparati, Università Ca' Foscari Venezia , Dorsoduro 1075, 30123 Venice , Italy

The possibility of comparing the linguistic competence of Language Models (LMs) to that of children has gained growing attention lately, raising the need for efective tools for evaluating both the former and the latter. To this purpose, we developed a resource for the linguistic evaluation of BabyLMs, which are LMs trained on datasets that comparable to the linguistic stimulus received by children. This resource adapts four standardized tests for the evaluation of linguistic skills of Italianspeaking children (BVL, TROG-2, TCGB-2 and Peabody). To verify the efectiveness of our benchmark, we administered it to Minerva, a LLM pretrained from scratch on Italian. Our results indicate that Minerva struggles to master certain linguistic aspects, achieving an age-equivalent score of 4 years, and that the type of task administered afects the model's performance.

eol>Language Models Linguistic Evaluation Benchmark BabyLMs Language Acquisition

1. Introduction

the light of the experiments in Section 5. Finally, in Section 6, some conclusions and possible future research directions are outlined. CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec 04 — 06, 2024, Pisa, Italy * Corresponding author. † For the specific purposes of Italian Academy, Luca Capone is responsible for Sections 2 and 4.1, Alice Suozzi and Luca Capone for Section 3, Alice Suozzi for Sections 4.2 and 5, Alessandro Lenci and Gianluca E. Lebani for Sections 1 and 6. $ luca.capone@fileli.unipi.it (L. Capone); alice.suozzi@unive.it (A. Suozzi); gianluca.lebani@unive.it (G. E. Lebani); alessandro.lenci@unipi.it (A. Lenci)

0000-0002-1872-6956 (L. Capone); 0000-0002-5215-7742 (A. Suozzi); 0000-0002-3588-1077 (G. E. Lebani); 0000-0001-5790-4308 (A. Lenci) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 10 million words per year on average, reaching around Attribution 4.0 International (CC BY 4.0). 100 million words by age 10. Zhang et al. [ 4 ] demonstrate Language Technology in Italian) archive; (ii) IT5 [16], that substantial amounts of data are required to achieve which focuses on summarization tasks; (iii) the Invalsi good results in NLU tasks, such as those evaluated by benchmark [17], which evaluates the mathematical and SuperGLUE [ 6 ]. Performance improvements become no- linguistic competences of LMs in Italian. Only the latter ticeable after surpassing the threshold of 1 billion words is relevant to our study, as it allows a comparison beand continue to improve steadily even beyond 30 billion tween human language learning (in the school-age range words. However, tasks focusing on language syntax (e.g., 6-18 years) and that of the models. However, the age acceptability judgment and minimal pairs) exhibit the range considered by Invalsi involves more sophisticated most significant improvements between 1 million and 100 NLU tasks, rather than the fundamental linguistic abilimillion words, after which the learning curve plateaus. ties learned during the preschool period, within the 100 The authors conclude that while acquiring factual knowl- million word budget. edge necessitates large volumes of text, syntactic and semantic competence reaches saturation within the range of 10 million to 100 million words. Similar conclusions are 3. Nurturing BaBIEs reported by Wei et al. [ 7 ], who investigate the emergent skills of various LLMs, confirming that the most sophis- In order to evaluate the linguistic abilities of BabyLMs, ticated behaviors primarily arise from scaling up model we developed BaBIEs by adapting four standardized tests training. These findings justify the focus on BabyLMs, designed to assess the linguistic competence of Italianwhich are LMs trained on limited amounts of data, quali- speaking children. These tests, which tap into diferent tatively resembling the stimuli received by a preschooler. aspects of linguistic competence, are: Huebner et al. [ 8 ] illustrate this approach by training • Batteria per la Valutazione del Linguaggio in BamBabyBERTa on 50 million words of child-directed speech bini dai 4 ai 12 anni (BVL) ’Battery for the Asand simplified written text, achieving results comparable sessment of Language in Children aged 4 to 12’ to RoBERTa-base on a grammar test suite. The BabyLM [18]. BVL is designed to provide a global linguischallenges [ 9 ] fall within this line of research, aiming tic profile of Italian-speaking children and was to optimize model training through curriculum learning standardized on a sample of 1,086 children aged (CL) techniques and architectural optimizations. This 4 to 12. It consists of 18 tasks (e.g., semantic and approach not only makes research more afordable, but phonological fluency, sentence and word comalso results in models that are more cognitively plausible prehension, emotional prosody comprehension, in comparison to human language acquisition. Although etc.) grouped into three sections, i.e., production, the proposed CL techniques did not lead to consistent comprehension, and repetition. improvements across all evaluation tasks [ 9 ], it has been demonstrated that a model trained with limited data (10 million words) can achieve results comparable to those of large LMs on various benchmarks. 2.2. Baby benchmarks for Baby models

These results prompt a reconsideration of the comparabil

ity between LMs training and human language learning.

While benchmarks like BLiMP [ 10 ] and GLUE [ 11 ] facilitate comparisons between diferent models, they are not suitable for comparing BabyLMs to children who are acquiring a first language. Several studies attempt to address this shortcoming. For instance, Evanson et al. [ 12 ] compare the learning order of certain syntactic structures in English between GPT-2 and preschoolers. They find that the model exhibits a consistent order in learning syntactic structures, which aligns with the one observed in preschoolers. Other tests that compare training in LMs to human language acquisition include the reading time test [ 13 ] and the age-of-acquisition test [14].

For the Italian language, the three main benchmarks are: (i) UINAUIL [15], which includes six NLU tasks selected from the EVALITA (Evaluation campaign for • Peabody - Test di vocabolario recettivo (Italian adaptation of the Peabody Picture Vocabulary Test - Revised) [19, 20]. PPVT-R is intended to measure the receptive vocabulary of the subject and was standardized on a sample of 2,400 aged 3 to 12 and 16. It consists of 175 items. • Test for Reception of Grammar - Version 2 (TROG-2) [21]. TROG-2 is designed to assess the comprehension of verbal language, especially syntactic structures, and was standardized on a sample of 1,276 subjects aged 4 to 87. It consists of 20 blocks, each containing four items that focus on a grammatical structure (e.g., zero anaphor, reversible in and on, relative clause in object, etc.). • Test di Comprensione Grammaticale per Bambini

Seconda Edizione (TCGB-2) ’Test of Grammatical Comprehension for Children - Second Edition’ [22]. Analogously to TROG-2, TCGB-2 is a tool for assessing the comprehension of grammatical structures and was standardized on a sample of 455 children aged 4 to 11. It contains 74 items which measure the comprehension of six struc- trated boards with four pictures, among which the child tures, i.e., the phenomenon of inflection, and five must choose the target picture that depicts the verbal types of sentences: locative, active, passive, rela- stimulus. Adapting these items involved converting the tive and dative. pictures into linguistic expressions, either single words or complex sentences, which consist of the linguistic deIt is worth noting that all tests are standardized on sam- scription of the distractor and target drawings. In the ples of typically-developing Italian-speaking subjects and Sentence Comprehension task, the pictures were conare designed to be orally administered. That is, the stim- verted into sentences maintaining the lexical items conuli are always read by the experimenter, and the child is stant whenever possible, and only altering the syntactic asked either to answer orally or to point at a picture. structure. This way, the target difers from the stimu

BaBIEs consists of five tasks (see Table 4 in Appendix lus syntactically, but not lexically. For instance, given A): this resource is twofold: (i) Sentence Completion (the the linguistic stimulus la pecora è spinta dal ragazzo ’the only task assessing linguistic production), (ii) Accept- sheep is pushed by the boy’, the possible answers are: ability Judgment, (iii) Idiom Comprehension, (iv) Sentence cioè il ragazzo indica la pecora; cioè la pecora spinge il Comprehension, (v) Lexical Comprehension. These tasks ragazzo; cioè il ragazzo spinge la pecora (TARGET); cioè are taken from BVL. We added 165 out of 175 items from il ragazzo guarda la pecora ’that is, the boy indicates the Peabody (Lexical Comprehension task) and all the items sheep; that is, the sheep pushes the boy; that is, the boy contained in TROG-2 and TCGB-2 (both Sentence Com- pushes the sheep (TARGET); that is, the boy looks at prehension tasks).1 Except for the Sentence Completion the sheep’. Since the relevant structure is the reversible task and the Acceptability Judgment task, all of the oth- passive, target and distractors are active clauses with ers are similarly-structured comprehension tasks. The the same lexical items as the linguistic stimulus. For child is presented with an oral linguistic stimulus (i.e., a the Lexical Comprehension task, the converted target word, a sentence or an idiom) and with a set of three or and distractors can be full sentences (especially if the four possible answers, from which the child must choose stimulus is a verb), words, or phrases. Since the target the answer corresponding to the linguistic stimulus (the converted from the target picture can not be identical to target). Together, a stimulus and its set of possible an- the stimulus word, we used a linguistic expression that swers constitute a test item. The key factor in the process is semantically-related to the stimulus (e.g., a synonym, of item adaptation from the original tests to BaBIEs was hypernym, hyponym, etc.). For instance, given the stimthe modality in which the sets of possible answers are ulus un trattore ’a tractor’, the set of possible answers displayed. is cioè un microscopio; cioè una ruspa (TARGET); cioè un

For the Acceptability Judgment task, we constructed binocolo; cioè una bicicletta ’that is, a microscope; that minimal pairs of sentences by creating a grammatical or is, a bulldozer (TARGET); that is, binoculars; that is, a ungrammatical version of the verbal stimulus (depending bicycle’. The target is una ruspa ’a bulldozer’, which is on the (un)grammaticality of the original stimulus). In semantically-related to the stimulus. this task, the model receives one pair at a time. Its choice The adapted version of the Lexical Comprehension is determined by perplexity, with the sentence having tasks (BVL and Peabody) functions as follows: each item the lowest perplexity score being chosen by the model. comprises a textual lexical stimulus (a word) followed

For the Sentence Completion and Idiom Comprehen- by a textual adaptation of the possible corresponding sion tasks, as both the stimuli and the sets of possible pictures, referred to hereafter as textual options (cf. Apanswers are linguistic expressions, the adaptation pro- pendix A). The lexical stimulus is concatenated with each cess only involved reformatting them to be readable by possible textual option to form four complex sentences. the model. The Sentence Completion task is modeled Noteworthy, we choose to concatenate the stimulus to in a fill-in-the-blank format. The LM is given a textual each textual option by means of cioè ’that is’, a conjuncsentence to complete, it receives one item at a time as tion used to clarify or restate something previously meninput and generates up to three new tokens. The answer tioned, which is particularly suited to make explicit the is considered correct if the correct completion appears relationship between the the stimulus and the textual in the generated sequence. options. The model’s choice is determined based on the

In contrast, the items for the Sentence and Lexical Com- perplexity obtained for each sentence. The same applies prehension tasks required substantial adaptation because to the Sentence Comprehension tasks, which comprises these tasks involve pictures in their original version. The items from the Sentence and Idiom Comprehension tasks sets of possible answers are indeed presented on illus- (BVL, TROG-2, and TCGB-2). Some examples of adapted items (one per task) and the structure of the entire dataset are given in Appendix A. 110 out of 175 items from Peabody were excluded, because either the words were too rare to be known by BabyLMs, e.g., emaciato ‘emaciated’, or it was impossible to adapt the item without using visual stimuli, e.g., for quadrato ‘square’.

4. Testing BaBIEs with Minerva

across all tasks is illustrated in Figure 1. Complete results, including accuracy for each clause type (Sentence 4.1. Model Comprehension task - BVL, TROG-2, TCGB-2) and partof-speech (Lexical Comprehension task - Peabody), are To verify the efectiveness of this test, it was presented provided in Appendix B. Minerva obtains the highest to a LM. Since no Italian LM primarily trained on child- accuracy in the Acceptability Judgment task (BVL) by directed speech and through curriculum learning was far, with 17/18 true predictions and an accuracy of 0.94. available, we opted for a conventional Italian LM2. Specif- Considering the standard scores, this falls between -1SD ically, we chose Minerva-3b-base-v1.0 (hereafter re- and +1SD for the age range 6.0-11,11 years (11,11 being ferred to as Minerva) [24], a decoder-only model (based the last age considered in the standardization of BVL). 3 on Mistral [25]) with 3 billion parameters. The choice was The accuracy is lower for the Sentence Completion task determined by the fact that, unlike other available mod- (BVL), which - it is worth repeating - is the only producels, Minerva was developed as an Italian model, despite tion task, i.e., 0.43, with 6/14 true predictions. This score also being pre-trained on a substantial amount of English is positioned between -1SD and +1SD for the age range text (660 billion tokens, 50% Italian and 50% English). 4,0-5,5 years. In the Idiom Comprehension Task (BVL), For the experiments, the Huggingface implementation of the true predictions given by Minerva are 5/10, and the the model was used. For the Sentence Completion task, accuracy is of 0.5. This score is only seemingly low. Inwe chose beam search as a generation strategy, with 3 deed, it falls between -1SD and +1SD for the age range beams. The models sampled the next generated token 6,6-8,11 years and beyond +2SD for the age range 4,0-4,5 among the 50 most probable words. We combined this years. Let us now turn to the Sentence and Lexical Comstrategy with nucleus sampling, by setting a probability prehension tasks (which involve picture-to-language conthreshold of 0.95. version). We used three Sentence Comprehension tasks (from BVL, TCGB-2, TROG-2), which tap into partially 4.2. Results diferent clause types (cf. Appendix B). In the BVL task, 20/40 true predictions are given by the model, corresponding to an accuracy of 0.5. The score is between -1SD and 0 for the age range 4,0-4,11 years. In the TCGB-2 task, the true predictions are 33/74, and the accuracy is 0.44.

The performance of Minerva is measured in terms of accuracy (number of true predictions relative to the total number of items). This measure is also used for evaluating children, allowing us to utilize standard scores to evaluate the model. The accuracy achieved by Minerva 2A new BabyLM [23] has been released a few weeks before the submission deadline. However, this model is not originally Italian but instead focuses on second language acquisition and its impact on the performance of a BabyLM.

3In standardized tests, the most frequent score obtained by children

of a given age range is represented by 0. The typical range score extends from -2SD to +2SD from 0. For scores below -2SD, the performance is considered deficient. In this study, we consider the score range -1SD to +1SD, as we are not interested in potential language impairments.

According to the standard scores of TCGB-2, the model is ‘neither...nor’). Minerva selects the correct answer for placed between the 32nd and 45th percentiles for the age 9/28 negative clauses (32.14%); of these, two are passives, range 3,6-3,11 years. These percentiles correspond to the six are active clauses, of which one contains a double judgment of within normal range (as opposed to excellent, negation. Wrong answers are selected for 19/29 negative good, etc.) In the task adapted from TROG-2, Minerva clauses (67.86%), of which 6 are passives, 13 are active reaches an accuracy of 0.42 (with 34/80 true predictions). clauses, of which 5 containing a double negation. Four In this test, the number of passed/failed blocks is relevant examples of wrong answers selected by Minerva are reto the purposes of standard scores (a block being passed ported in Table 1. Such errors suggest that the model if the child provides the target response for at least 3/4 does not interpret negation, or in the case of clauses items). The model passes 6/20 blocks, obtaining an age- containing double negation, at least one of them, consisequivalent score of 4,1 years. The standard score for this tent with previous findings in the literature ([ 26], [27]). age is 115, which falls into the 84th percentile. Finally, The complete sets of possible answers of the examples we used two Lexical Comprehension item sets (from BVL reported in Table 1) are given in Appendix C. and Peabody). In the former (BVL), Minerva provides As can be seen in Table 1, the wrong answers selected 5/18 true predictions, that correspond to an accuracy of by Minerva result from the failure to interpret the nega0.37. This score is below -2SD for the age range 4,0-4,5 tion. In one case (i.e., the third example), the selected years (4,0 years is the minimum age considered for the answer reveals that the model only interpreted the secstandardization). In the latter (Peabody), 62/165 predic- ond (but not the first) negation. tions are true, the accuracy being 0.37. As mentioned The best score is obtained in the Acceptability Judgabove, we excluded 10 items from the adaptation process. ments task. This is not surprising and primarily due to Since the test age-equivalent scores are computed based the task being formulated with minimal pairs, a method on 175 items, we consider the raw-score range of 62-72 proven to be particularly efective in testing LMs [ 10 ]. to establish the age-equivalent score of Minerva, so as to In the other tasks, the results are worse. Nonetheless, also take into account the excluded items. This raw-score the age-equivalent score is not the whole story. In the range corresponds to the age-equivalent score range of Sentence Completion task, for instance, in spite of the 102-109 for the age range 3,9-4,2 years (i.e., between 0 and low score obtained, the completions are not ungram+1SD) and 92-99 for the age range 4,3-4,8 (i.e., between matical or nonsensical (cf. Table 2, more examples are -1SD and 0). provided in Appendix C). In the Lexical Comprehension tasks, the score further decreases. The results in both tasks (from BVL and Peabody) are fairly consistent, with 5. Discussion an age score struggling to reach 4,5 years. The dificulties encountered by the model can be attributed to the limited context and the nature of the task, which is primarily semantic. The model also performs well in the Idiom Comprehension task, probably because idiomatic expressions are high-frequency expressions that a model trained on large amount of texts might easily have encountered.

This could also explain why the score is lower for the Sentence Comprehension tasks, although the two are structurally similar. Indeed, unlike idiomatic expressions, the items of these tasks are less predictable and require a certain degree of inference for resolution, making their complexity more similar to that of Lexical Comprehension tasks.

The scores obtained by Minerva generally align with the linguistic-age range 4.0-5.0. Variability in scores is observed i.) across diferent tasks, indicating that certain tasks may be easier for the model than others; and ii.) within the same type of task depending on the specific test they were adapted from (e.g., BVL–Sentence Comprehension, TROG-2). This discrepancy may be due to the adaptation of the test items, which, in turn, depends on the original distractor and target pictures. For instance, items in the Lexical Comprehension task of BVL required the model to make inferences to generate accurate predictions. Another possible factor (e.g., in the Sentence Comprehension task) is the complexity of specific syntactic structures evaluated by some tests. For instance, locative structures are particularly challenging for the model, as are passive clauses (cf. Appendix B). The model often fails to consistently grasp the rationale linking the stimulus and the target answer, likely due to Minerva not being an instruction-tuned model. Negation (Sentence Comprehension Task) is an illustrative example in this respect. BaBIEs contains 28 negative clauses (8/28 are passive clauses, and 20/28 are active clauses. Among the active clauses, 6 contain a double negation, i.e., né...né

6. Conclusions and future work This paper presents BaBIEs, a novel resource specifi

cally designed to evaluate the linguistic competence of BabyLMs and compare them to those of children. After having detailed the sources and the creation process of this resource, we provided the procedure for testing the Minerva model with the resource itself. Finally, we presented and discussed the results the model’s performance.

Acknowledgments We acknowledge financial support under the PRIN

2022 Project Title "Computational and linguistic benchBased on the presented findings, the resource appears a marks for the study of verb argument structure" – CUP valuable tool for evaluating not only BabyLMs but LMs I53D23004050006 - Grant Assignment Decree No. 1016 in general. The poor performance exhibited by Minerva adopted on 07/07/2023 by the Italian Ministry of Univerunderscores the gap between child language acquisition sity and Research (MUR). This research was also partly and current language model training. This highligths the funded by PNRR—M4C2—Investimento 1.3, Partenariato necessity for modifying model training to better encode Esteso PE00000013—“FAIR—Future Artificial Intelligence human language and, more generally, human linguistic Research”—Spoke 1 “Human-centered AI,” funded by the competence. European Commission under the NextGeneration EU

Future work will involve a more systematic linguis- programme. tic analysis of the model’s performance, together with a comprehensive error analysis and a comparison to adult Italian-speakers. Furthermore, it will involve the devel- References opment of a multimodal version of the test, which will more closely reflect the original tests and allow the evaluation of multimodal BabyLMs. Additionally, a BabyLM trained exclusively with Italian child-directed speech will be developed and evaluated with both the standard and multimodal versions of the test. of the Association for Computational Linguistics: Human Language Technologies, Online, 2021, pp.

1301–1312. [27] T. H. Truong, T. Baldwin, K. Verspoor, T. Cohn,

Language models are not naysayers: an analysis of language models on negation benchmarks, in: A. Palmer, J. Camacho-collados (Eds.), Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), Toronto, Canada, 2023, pp. 101–114.

A. Appendix A: Examples of adapted items B. Appendix B: Complete Results

C. Appendix C: Examples of Target and Wrong Answers Provided by

Minerva

La ragazza non sta né indicando né correndo ‘The girl is neither pointing

nor running’ La scatola non è né grande

né gialla ‘The box is neither big nor yellow’ 1. La bambina sta correndo

‘The girl is running’ 2. Le bambine stanno correndo ‘The girls are running’ 3. La bambina raggiunge la mamma ‘The girl reaches her mom’ 4. La bambina è ferma ‘The girl is still’ 1. Il cestino è vuoto ‘The bin is empty’ 2. Il cestino è pieno

‘The bin is full’ 3. La mamma svuota il cestino ‘The mom empties the bin’ 4. Il bambino ha svuotato

il cestino ‘The boy has emptied the bin’ 1. La ragazza corre

ma non indica ‘The girl is running but not pointing’ 2. La ragazza è ferma

‘The girl is still’ 3. La ragazza corre e indica ‘The girl is running and pointing’ 4. La ragazza indica

ma non corre ‘The girl is pointing

but not running’

[1]

Kaplan ,

McCandlish ,

Henighan , T. B. Brown , B.

Chess , R.

Child , S.

Gray , A.

Radford , J.

Wu , D.

Amodei , Scaling laws for neural language models , arXiv preprint arXiv: 2001 . 08361 ( 2020 ).

[2]

Villalobos ,

Sevilla ,

Heim ,

Besiroglu ,

Hobbhahn ,

Ho , Will we run out of data? an analysis of the limits of scaling datasets in machine learning , arXiv preprint arXiv:2211.04325 ( 2022 ).

[3]

Warstadt ,

S. R.

Bowman , What artificial neural networks can tell us about human language acquisition , in: S. Lappin, J.-P. Bernardy (Eds.), Algebraic ory in 11 languages, Transactions of the Association structures in natural language , CRC Press, Boca for Computational Linguistics 11 ( 2023 ) 1451 - 1470 . Raton, 2022 , pp. 17 - 60 . [14]

T. A.

Chang ,

B. K.

Bergen , Word acquisition in

[4]

Zhang ,

Warstadt ,

H.-S.

Li ,

S. R.

Bowman , neural language models, Transactions of the AsWhen Do You Need Billions of Words of Pretraining sociation for Computational Linguistics 10 ( 2022 ) Data? , in: Proceedings of the 59th Annual Meeting 1-16. of the Association for Computational Linguistics [15]

Basile ,

Bioglio ,

Bosca ,

Bosco ,

Patti , and the 11th International Joint Conference on Nat- UINAUIL: A unified benchmark for Italian natural ural Language Processing (Volume 1: Long Papers), language understanding , in: Proceedings of the 2021 , pp. 1112 - 1125 . 61st Annual Meeting of the Association for Com-

[5]

Hart ,

T. R.

Risley , Meaningful diferences in the putational Linguistics (Volume 3: System Demoneveryday experience of young American children , strations) , 2023 , pp. 348 - 356 . Brookes, Baltimore, 1995 . [16]

Sarti , M.

Nissim, It5: Text-to-text pretraining for

[6]

Wang ,

Pruksachatkun ,

Nangia , A. Singh, italian language understanding and generation , in: J. Michael , F.

Hill , O.

Levy , S.

Bowman , Superglue: N.

Calzolari , M.- Y.

Kan , V.

Hoste , A.

Lenci , S.

Sakti , A stickier benchmark for general-purpose language N . Xue (Eds.), Proceedings of the 2024 Joint Inunderstanding systems, Advances in neural infor- ternational Conference on Computational Linguismation processing systems 32 ( 2019 ). tics, Language Resources and Evaluation (LREC-

[7]

Wei ,

Tay ,

Bommasani ,

Rafel ,

Zoph , COLING 2024 ), 2024 , pp. 9422 - 9433 . S. Borgeaud,

Yogatama ,

Bosma ,

Zhou , [17]

Esuli , G. Puccetti, The Invalsi Benchmark: meaD. Metzler , et al., Emergent abilities of large lan- suring Language Models Mathematical and Language models, arXiv preprint arXiv:2206 . 07682 guage understanding in Italian, arXiv preprint ( 2022 ). arXiv: 2403 .18697 ( 2024 ).

[8]

P. A.

Huebner ,

Sulem ,

Cynthia ,

Roth , Baby- [18]

Marini , Batteria per la Valutazione del LinguagBERTa: Learning more grammar with small-scale gio in bambini dai 4 ai 12 anni, Giunti Psychometchild-directed language , in: Proceedings of the rics, Firenze , 2015 . 25th conference on computational natural language [19]

L. M.

Dunn ,

L. M.

Dunn , Peabody Picture Vocablearning, 2021 , pp. 624 - 646 . ulary Test - Revised, American Guidance Service,

[9]

Warstadt ,

Mueller ,

Choshen , E. Wilcox, Minneapolis, 1981 . C. Zhuang , J.

Ciro , R.

Mosquera , B.

Paranjabe , [20] G.

Stella , C.

Pizzioli , P. E.

Tressoldi , Peabody - Test A. Williams , T. Linzen , et al., Findings of the di vocabolario recettivo , Omega, Torino, 2000 . BabyLM Challenge: Sample-eficient pretraining on [21]

D. V.

Bishop , Test for Reception of Grammar - Verdevelopmentally plausible corpora , in: Proceedings sion 2 , Giunti

Psychometrics

, Firenze, 2009 . of the BabyLM Challenge at the 27th Conference on [22]

Chilosi ,

Piazzalunga ,

Pfanner ,

Cipriani , Computational Natural Language Learning , 2023 , Test di Comprensione Grammaticale per Bambinipp. 1 - 34 . Seconda Edizione, Hogrefe, Firenze, 2023 .

[10]

Warstadt ,

Parrish ,

Liu ,

Mohananey , [23]

Shen ,

Joshi , R.-C. Chen, BAMBINOW. Peng,

S.-F.

Wang ,

S. R.

Bowman , BLiMP: The

: (Bilingual-) Human-Inspired Continual benchmark of linguistic minimal pairs for English, Pretraining of BabyLM, arXiv preprint Transactions of the Association for Computational arXiv : 2406 .11418 ( 2024 ). Linguistics 8 ( 2020 ) 377 - 392 . [24]

Orlando ,

P.-L. H.

Cabot ,

Moroni , S. Co-

[11]

Wang ,

Singh ,

Michael ,

Hill , O. Levy , nia, E. Barba,

Navigli , Minerva- 3b -baseS. R. Bowman , GLUE: A multi-task benchmark and v1.0, huggingface .co/sapienzanlp/Minerva-3B -baseanalysis platform for natural language understand- v1.0 (2024). ing , in: Proceedings of the 2018 EMNLP Workshop [25]

A. Q.

Jiang ,

Sablayrolles ,

Mensch , C. Bamford, BlackboxNLP: Analyzing and Interpreting Neural D. S. Chaplot , D. d. l. Casas, F. Bressand , G. Lengyel, Networks for

NLP

, 2018 , pp. 353 - 355 . G. Lample,

Saulnier , et al., Mistral

, arXiv

[12]

Evanson ,

Lakretz ,

J.-R.

King , Language ac- preprint arXiv:2310.06825 ( 2023 ). quisition: do children and language models follow [26]

Hosseini ,

Reddy ,

Bahdanau , R. D. Hjelm, similar learning stages? , in: A. Rogers , J.

Boyd- A. Sordoni , A.

Courville , Understanding by underGraber, N. Okazaki (Eds.) , Findings of the Associa- standing not: Modeling negation in language modtion for Computational Linguistics: ACL 2023 , 2023 , els, in: K. Toutanova , A. Rumshisky , L. Zettlemoyer, pp. 12205 - 12218 . D. Hakkani-Tur , I.

Beltagy , S.

Bethard , R. Cotterell,

[13]

E. G.

Wilcox ,

Pimentel ,

Meister ,

Cotterell ,

Chakraborty , Y. Zhou (Eds.), Proceedings of the R. P. Levy, Testing the predictions of surprisal the- 2021 Conference of the North American Chapter 1. La bambina sta correndo 'The girl is running' (WRONG) 4. Il bambino ha svuotato il cestino 'The boy has emptied the bin' (WRONG) 4. La ragazza indica ma non corre 'The girl is pointing but not running' (WRONG) 2. La scatola è grande e gialla 'The box is big and yellow' (WRONG)